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Abstract 

In exploratory data analysis, we are often interested in identifying promising pairwise associations for 
further analysis while filtering out weaker, less interesting ones. This can be accomplished by computing 
a measure of dependence on all possible variable pairs and examining the highest-scoring pairs, provided 
the measure of dependence used assigns similar scores to equally noisy relationships of different types. 
This property, called equitability , is formalized in Reshef et al. [2015b]. In addition to equitability, 
measures of dependence can also be assessed by the power of their corresponding independence tests as 
well as their runtime. 

Here we present extensive empirical evaluation of the equitability, power against independence, and 
runtime of several leading measures of dependence. These include two statistics newly introduced in 
Reshef et al. [2015a]: MIC e , which has equitability as its primary goal, and TIC e , which has power 
against independence as its primary goal. 

Regarding equitability, our analysis finds that MIC e is the most equitable method on functional 
relationships in most of the settings we considered, although mutual information estimation proves the 
most equitable at large sample sizes in some specific settings. Regarding power against independence, 
we find that TIC e , along with Heller and Gorhne’s S DDP , is the state of the art on the relationships 
we tested. Our analyses also show evidence for a trade-off between power against independence and 
equitability consistent with the theory in Reshef et al. [2015b]. In terms of runtime, MIC e and TIC e 
are significantly faster than many other measures of dependence tested. Moreover, computing either 
one makes computing the other trivial. This suggests that a fast and useful strategy for achieving a 
combination of power against independence and equitability may be to filter relationships by TIC e and 
then to examine the MIC e of only the significant ones. 

We conclude with a discussion of the settings in which MIC e and TIC e are (and are not) appropriate 
tools. It is our hope that this work provides a practical guide for the use of MIC e , TIC e , and related 
statistics, and for the role of equitability more generally. 


1 Introduction 

Suppose we have a high-dimensional data set with hundreds or thousands of dimensions and we wish to 
find interesting associations within it to analyze further. Even if we only search for pairwise associations 
among the variables, the number of potential relationships to examine is unmanageably large, necessitating 
automation to assist in the search. In this context, a common, simple approach is to compute some statistic 
on each combination of variables, rank the variable pairs from highest- to lowest-scoring, and then examine 
a small number of the top-scoring variable pairs in the resulting list. 

The success of this strategy depends heavily on the statistic used. One natural approach is to use a 
measure of dependence, that is, a statistic whose population value is zero when the variables in question are 
statistically independent and non-zero otherwise. However, this is not sufficient to guarantee success. To 
see this, imagine using such a statistic (p on a data set containing many noisy linear relationships as well as 
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a smaller number of strong sinusoidal relationships. The fact that (p is a measure of dependence guarantees 
us that, given sufficient sample size, all of these relationships will receive non-trivial scores. Unfortunately 
though, it tells us nothing about how those non-trivial scores will compare to each other. For example, (p 
could systematically assign higher scores to linear relationships than to sinusoidal relationships. If that is 
the case, then when we rank relationships by <f> the noisy linear relationships may crowd out the sinusoidal 
relationships from the top of the list. Since we can only manually examine a relatively small number of 
relationships from the top of the list, we may therefore miss the sinusoidal relationships even though they 
are strong. 

If our goal were simply to detect as many relationships as possible, then the measure of dependence 
(p would perform well to the extent that its associated independence test has good power. But a high¬ 
dimensional data set may contain a very large number of non-trivial relationships, some strong and others 
weak, and a list of all of them may be too large to allow for manual follow-up of each identified relationship 
Reshef et al. [2015b]; Emilsson et al. [2008]. Thus, in the exploration of large data sets, our goal is often not 
only to detect as many of the non-trivial associations in the data set as possible, but also to rank them by 
some notion of strength. For this task, deviation from independence can be too weak a search criterion. 

One framework to address this challenge utilizes a property called equitability. Loosely, an equitable 
measure of dependence is one that gives similar scores to equally noisy relationships of different types 
[Reshef et al., 2011]. This definition is formalized in Reshef et al. [2015b] and shown there to be equivalent 
to power against a range of null hypotheses corresponding to different relationship strengths rather than 
the single null hypothesis of statistical independence (i.e., zero relationship strength). While the general 
concept of equitability is quite broad, one intuitive and natural instantiation is that, when used on functional 
relationships, the value of an equitable measure of dependence should reflect the coefficient of determination 
( R 2 ) with respect to the generating function with as weak a dependence as possible on the particular function 
in question. 

Equitability is a difficult property to achieve, and most measures of dependence do not have high eq¬ 
uitability on functional relationships. (This is understandable, as they are not designed with that goal in 
mind.) One statistic that has shown good equitability on functional relationships is the maximal informa¬ 
tion coefficient (MIC) [Reshef et al., 2011]. In Reshef et al. [2015a] a new, efficiently computable, consistent 
estimator of the population MIC, called MIC e , is introduced, along with a related measure of dependence 
called the total information coefficient TIC e , which is essentially free to compute when MIC e is computed. 

In this paper, we demonstrate how the theoretical advances of Reshef et al. [2015b, a] translate into prac¬ 
tical benefits via extensive empirical analyses, under a wide range of settings, of the equitability, power, and 
runtime of MIC e , TIC e , and several leading measures of dependence: MIC [Reshef et al., 2011], distance 
correlation [Szekely and Rizzo, 2009; Szekely et al., 2007], mutual information estimation [Kraskov et al., 
2004], maximal correlation [Renyi, 1959; Breiman and Friedman, 1985], the randomized dependence coeffi¬ 
cient (RDC) [Lopez-Paz et al., 2013], the Heller-Heller-Gorfine distance (HHG) [Heller et al., 2013], S DDP 
[Heller et al., 2014], and the Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 2005, 2008, 
2012]. Throughout our analyses, we show how the theoretical framework of Reshef et al. [2015b] can be used 
to rigorously quantify equitability in practice. 

Our analyses yield four main conclusions. First, with regard to equitability, they show that estimation 
of the population MIC via MIC e is more equitable than other methods across the majority (32 out of 36) of 
the settings of noise/marginal distributions and sample size that we tested. (In the remaining four settings, 
the Kraskov mutual information estimator outperforms MIC e .) 

The second conclusion we draw is that the total information coefficient TIC e achieves overall statistical 
power against independence that is state-of-the-art. State-of-the-art power against independence is also 
achieved by Heller and Gorfine’s S DDP , which outperforms TIC e by some metrics and is outperformed 
by TIC e i n others. The power of TIC e is high not just overall, but also on each individual alternative 
hypothesis relationship type we examined, meaning that we did not identify any one relationship type that 
TIC e is especially poorly suited for detecting. 

The third conclusion is that the power against independence of MIC e , the new estimator of the population 
MIC, is competitive with other state-of-the-art techniques, albeit with a different setting of its parameter a 
than the one that confers good equitability. This observation leads us to characterize a power-equitability 
trade-off that is captured by this parameter and appears consistent with the theory of equitability developed 
in Reshef et al. [2015b] together with “no free lunch” considerations. 
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Our final conclusion concerns runtime. We find that MIC e and TIC e are as fast as or faster than most 
other methods tested. Even at a sample size of n = 5, 000, running MIC e /TIC e on all variable pairs in a 
1, 000-variable data set using a 100-node cluster, with parameters that yield state-of-the-art power against 
independence and near-optimal equitability, takes just 8.1 minutes. Moreover, once either MIC e or TIC e is 
computed, the other can be computed trivially. 

Taken together, our results suggest that MIC e can be efficiently used in conjunction with TIC e to achieve 
a useful mix of power against independence (by filtering results using TIC e ) and equitability (by using MIC e 
on the remaining variable pairs) when exploring a data set. 

Together, this paper, Reshef et al. [2015b], and Reshef et al. [2015a] have three primary objectives. The 
first is to formalize the theory behind both equitability and the maximal information coefficient. The second 
is to introduce and analyze a new estimator of the population MIC as well as a new measure of dependence 
called the total information coefficient. The third is to provide an extensive comparison of the performance 
of a set of state-of-the-art measures of dependence in a wide range of settings in terms of equitability, power 
against independence, and runtime. While this paper is focused primarily on the performance comparison, 
providing direct and in-depth comparisons to existing methods, we hope these papers together expand the 
use of both this framework for data analysis and the existing algorithms. 

The rest of this paper is organized as follows. In Section 2 we cover preliminaries, in Section 3 we 
give a brief review of equitability, in Section 4 we analyze the equitability of the methods in question, in 
Section 5 we analyze their power against independence, in Section 6 we characterize the tradeoff between 
power against independence and equitability, in Section 7 we analyze runtime, and in Section 8 we offer a 
concluding discussion. 

2 Preliminaries 

As we extensively analyze several statistics introduced in Reshef et al. [2011] and Reshef et al. [2015a], we 
start by reviewing the definitions of those statistics and related objects. The informed reader may skip this 
section and refer to it as needed. 

2.1 Overview and notation 

The statistics we present here are two estimators of the maximal information coefficient, as well as the 
total information coefficient. For all of these statistics, we have a sample from the distribution of some 
two-dimensional random variable (X, Y). The goal in estimating the maximal information coefficient is to 
provide a score in the form of a number between 0 and 1 that quantifies the strength of the relationship 
between X and Y in an equitable way (see Section 3 for a review of equitability). The goal in computing 
the total information coefficient is to obtain a statistic for testing for the presence or absence of statistical 
independence between X and Y. 

For all statistics, we use the following notational conventions. Let G be a finite grid drawn on the 
Euclidean plane. Given a point (x,y) G M 2 , we define the function row^^) to be the row of G containing 
y and we define col g(%) analogously. For a pair (X, Y) of jointly distributed random variables, we write 
(X, Y)\q to denote the discrete random variable (co1g(X), rowc(F)). For natural numbers k and £, we use 
G(k,£) to denote the set of all k-by-£ grids (possibly with empty rows/columns). Given a finite sample D 
from the distribution of (X, F), we use D to refer both to the set of points in the sample as well as to a 
point chosen uniformly at random from D. In the latter case, it then makes sense to talk about, e.g., D\q 
and I(D\g)- 

2.2 The maximal information coefficient 

The maximal information coefficient (MIC) is a statistic introduced in Reshef et al. [2011] as a way to achieve 
good equitability on a wide range of relationship types. In Reshef et al. [2015a], the population value of this 
statistic is computed and a new estimator of that population value is given. Here we define all three of these 
objects. 
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2.2.1 The population MIC 


We begin by defining the population value of MIC, which we denote by MIC*. To define this quantity, we 
must first define an object called the population characteristic matrix. The population MIC will then be the 
supremum of this matrix. 

Definition 2.1 (Reshef et al. [2015a]). Let (X, F) be jointly distributed random variables. Let 

r((VnM)= max I((X,Y)\ g ) 

GeG(k,£) 


where I represents the mutual information. The population characteristic matrix of (X, F), denoted by 
M(X, F), is defined by 


M(X,F) M 


J*((X,F),M) 

log min{&, £} 


for k,i > 1. 


For more on mutual information see, e.g., Cover and Thomas [2006] and Csiszar and Shields [2004]). 
The characteristic matrix is so named because in Reshef et al. [2011] it was hypothesized that this matrix 
takes on different “shapes” that are characteristic of different relationship types, so that different properties of 
the matrix may correspond to different properties of relationships. One such property was the maximal value 
of the matrix. This is called the maximal information coefficient (MIC), and its corresponding population 
quantity is defined below. 

Definition 2.2 (Reshef et al. [2015a]). Let (X, F) be jointly distributed random variables. The population 
maximal information coefficient (MIC*) of (X, F) is defined by 


MIC* (X, F) = sup M(X, F). 


The population MIC has several alternate characterizations, both as a canonical smoothing of mutual 
information and as the supremum of the boundary of the characteristic matrix. For more, see Reshef et al. 
[2015a]. 


2.2.2 Estimators of MIC* 

In this work we study two different estimators of the population MIC. 


The first estimator: MIC The first statistic we analyze is the original statistic introduced in Reshef et al. 
[2011], which estimates MIC* by first estimating each entry of the characteristic matrix until a sample size- 
dependent maximal grid resolution. This estimated characteristic matrix is called the sample characteristic 
matrix and is defined below. 


Definition 2.3 (Reshef et al. [2011]). Let D C M 2 be a set of ordered pairs. 
matrix M(D) of D is defined by 


M(D) m 


J*(AM) 

log min{&, £} 


The sample characteristic 


MIC is then the maximum of the sample characteristic matrix, subject to a sample size-dependent limit 
on the maximal allowed grid resolution. 


Definition 2.4 (Reshef et al. [2011]). Let D C R 2 be a set of n ordered pairs, and let B : Z + —)> Z + . We 
define 

MICb(D) = max M(D)/ e £. 

k£<B(n) 


The statistic MIC is proven in Reshef et al. [2015a] to be a consistent estimator of the population MIC, 
provided cj(1) < B(-) < 0(n 1 ~ £ ) for e > 0. However, it is not known how to efficiently compute the exact 
value of MIC, and so in practice a heuristic dynamic-programming approximation algorithm is used. 
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The second estimator: MIC e The second statistic we analyze is MIC e , a statistic introduced in Reshef 
et al. [2015a] and proven there to be a consistent estimator of MIC*. In contrast to MIC, it is known how 
to compute MIC e exactly in polynomial time (although in practice other, still more efficient statistics may 
nevertheless be used; see below). Rather than attempting to estimate any entries of the characteristic matrix, 
MIC e estimates a different matrix, the equicharacteristic matrix , whose supremum is the same as that of the 
characteristic matrix. Estimates of entries of this other matrix turn out to be both much easier to compute 
and sufficient for estimating MIC*. 

We first define the sample equicharacteristic matrix, along with a prerequisite definition. 

Definition 2.5 (Reshef et al. [2015a]). Let (X,Y) be a pair of jointly distributed random variables. Define 

r ((X, Y), k, [£}) = G max M) I ((X, F)| G ) 

where G(fc, [£]) is the set of k-by-£ grids whose y-axis partition is an equipartition of size £. Define 
r ((X, Y ), [k\ , £) analogously. 

Define /M((X, Y),k,£) to equal I*((X,Y),k, [£}) if k < £ and /*((X,F), [k},£) otherwise. 

Definition 2.6 (Reshef et al. [2015a]). Let D C M 2 be a set of ordered pairs. The sample equicharacteristic 
matrix \M](D) of D is defined by 

/W(D,M) 

log min {k,£} 

We can now define the second estimator, MIC e . 

Definition 2.7 (Reshef et al. [2015a]). Let D C M 2 be a set of n ordered pairs, and let B : Z + —Z + . We 
define 

MIC e ^(D) = max [M](D) k/ . 

k£<B(n ) 

MIC e can be computed using dynamic programming, resulting in a search procedure that takes time 
0(n 2 B(n ) 2 ), which equals 0(n 2+2a ) when B(n) = n a . In practice, however, this algorithm can be modified 
to include a parameter c that controls the coarseness of the discretization of the grid-maximization search. 
The modified statistic remains a consistent estimator of MIC* and runs in time 0(c 2 B(n ) 5 / 2 ) = 0(c 2 n 5a / 2 ) 
[Reshef et al., 2015a]. In this work we use MIC e to refer both to the statistic as defined above and to the 
result of this modified algorithm. For more, see Reshef et al. [2015a]. 

2.3 The total information coefficient 

While the maximal information coefficient aims to measure the strength of a relationship equitably, the total 
information coefficient (TIC), introduced in Reshef et al. [2015a], provides a way of testing for the presence 
or absence of statistical independence with good power and is a trivial side-product of the computation of 
the maximal information coefficient. 

The intuition behind the total information coefficient is that while estimating MIC* has many advan¬ 
tages, this estimation involves taking a maximum over many estimates of entries of the characteristic or 
equicharacteristic matrix. Since the maximum of a set of random variables tends to become large as the 
number of variables grows, one can imagine that this procedure can lead to an unwanted positive bias in 
the case of statistical independence, when the population characteristic matrix equals 0, and a consequent 
reduction in power against independence. 

To circumvent this problem, the total information coefficient is not the maximum but the sum of the 
entries of the matrix. Since this property of the matrix has better statistical properties, we might expect it to 
have a smaller bias in the case of statistical independence and therefore better power. Stated alternatively, 
if our only goal is to distinguish any dependence at all from complete noise, then disregarding all of the 
sample characteristic/equicharacteristic matrix except for its maximal value throws away useful signal, and 
the total information coefficient avoids this by summing all the entries. 
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2.3.1 The statistic TIC ( 


The version of the total information coefficient studied in this work is analogous to the statistic MIC e 
presented above in that it proceeds via summation not of the sample characteristic matrix M but rather of 
the sample equicharacteristic matrix [M]. 

Definition 2.8. Let D C l 2 be a set of n ordered pairs. Given a function B : Z + -A Z + , we define 
TIC e5jB (D) to be 

TICe )B (L>) = ]T \M\{D)k,t 

k£<B(n) 

where [M] ( D ) is the sample equicharacteristic matrix. 

In Reshef et al. [2015a] it is proven that TIC e yields a consistent right-tailed independence test, provided 
o;(l) < B(ri) < 0(n 1_£ ) for 5 > 0. As with MIC e , there is an additional parameter c that controls the 
coarseness of the discretization of the grid search when TIC e is computed. However, this does not affect the 
consistency of the corresponding independence test. See Reshef et al. [2015a] for more detail. 

2.4 Summary of MIC and TIC-related statistics 

Table 1 lists the objects discussed in this section. 


Object 

Description 

Defined in 

MIC 

Statistic for quantifying relationship strength 

Reshef et al. [2011] 

MIC* 

Population value of MIC 

Reshef et al. [2015a] 

MIC e 

Estimator of MIC* via equicharacteristic matrix 

Reshef et al. [2015a] 

TICe 

Statistic for testing for independence 

Reshef et al. [2015a] 


Table 1: Statistics and estimands related to the maximal and total information coefficients. 


3 A review of equitability 

Equitability is a property of measures of dependence introduced in Reshef et al. [2011] and formalized in 
Reshef et al. [2015b] that is particularly useful in the context of data exploration. Because this paper 
analyzes the equitability of several leading measures of dependence, we first present here a review of the 
basic definitions of- and results about equitability from Reshef et al. [2015b]. 

There are two different ways to view equitability, each with its corresponding intuition. The first states 
roughly that an equitable measure of dependence “give[s] similar scores to equally noisy relationships of 
different types” [Reshef et al., 2011]. In this viewpoint, a highly equitable measure of dependence allows us 
notionally to find the “strongest K n relationships in our data set for any K. 

The second view of equitability is based on statistical power: an equitable measure of dependence provides 
good tests for distinguishing between relationships with different, potentially non-zero amounts of noise. In 
other words, instead of yielding tests that only reject a null hypothesis of independence (i.e., “relationship 
strength = 0”), an equitable measure of dependence yields tests for rejecting null hypotheses of the form 
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“relationship strength < xo” for all possible xq. That is, a highly equitable measure of dependence allows 
us to find with high power all the relationships in our data set with “strength at least Xo” for any x$. 

These two viewpoints are formalized and shown to be equivalent in Reshef et al. [2015b]. We now 
summarize those formalizations as well as their equivalence, together with some examples and intuition. 

3.1 Defining equitability via power 

Let 0 be some statistic. To be able to talk rigorously about the equitability of <£, we must specify two things: 
a set Q of distributions on which we can state what we mean by relationship strength, and a corresponding 
function <f> : Q —>• [0,1] that computes that strength. The set Q is called the set of standard relationships 
and the function <F is called the property of interest. 

A natural setting to keep in mind is that Q is some diverse set of functional relationships with noise 
added and <f> is R 2 , i.e., the coefficient of determination with respect to the generating function. We return 
to this example often as a way to build intuition. 

We can now define equitability in terms of power against a broad class of null hypotheses. 1 

Definition 3.1. Let 0 be a statistic, let Q be a set of standard relationships, let <F : Q —)> [0,1], and fix 
some 0 < a < 1/2. The statistic 0 is 1/ d-equitable with respect to <F with confidence 1 — 2a if and only if 
for every xo,xi G [0,1] satisfying x\ — xo > d, there exists a right-tailed level-o test based on 0 that can 
distinguish between Hq : <&(Z) < xq and Hi : $>(Z) > x\ with power at least 1 — a. 

The smaller d is the better, and consequently the best equitability that can be achieved is when d = 0, 
and the statistic in question is oo-equitable. This is called perfect equitability , and is generally discussed as 
a property of the population value of a statistic. 

This definition of equitability is illustrated schematically in Figure 1. It implies that when <f> is 0 precisely 
in cases of statistical independence, equitability can be viewed as a generalization of power against statistical 
independence on Q. Specifically, when we set xo = 0, a statistic being 1/d-equitable means that that statistic 
yields a test that has good power against independence on any alternative hypothesis as extreme or more 
extreme than Hi : <f> = d. In general, the definition says that a 1/d-equitable statistic allows us to, given some 
threshold xo of relationship strength as measured by <F, successfully identify all the relationships in a data 
set with strength greater than xo + d. This may be important if our data set has many weak relationships 
and a smaller number of strong relationships that we would like to find. 

As the formalization just presented makes clear, an analysis of equitability must differ from conventional 
analyses of power against independence in two ways. First, statistical independence represents only one null 
hypothesis, in contrast to the many null hypotheses against which equitability requires good power. Second, 
since in the setting of equitability the model Q will contain multiple distinct classes of relationship types 
(e.g., linear, exponential, etc.), the null and alternative hypotheses that must be analyzed are composite. 

3.2 Defining equitability via interpretability 

In addition to the view that defines equitability in terms of power, we can take an alternative approach that 
directly formalizes the intuition that an equitable statistic assigns similar scores to equally noisy relationships 
of different types. To do so, we must define two concepts, reliability and interpretability , which invoke 
acceptance regions and interval estimates, respectively. For clarity of exposition, we avoid using the term 
“equitability” in the following, since we have already defined it previously. However, what we describe here 
as “worst-case interpretability” will turn out to be equivalent to equitability. 

We begin with the definition of reliability. 

Definition 3.2 (Reshef et al. [2015b]). Let 0 : M 2n -G [0,1] be a statistic, let x, a G [0,1]. The (^-reliable 
interval of 0 at x, denoted by Rf (x), is the smallest closed interval A with the property that, for all Z G Q 
with $>(Z) = x, 

P (0(D) < min A) < a /2 and P (0(D) > max A) < a /2 

1 We deviate here from Reshef et al. [2015b] in that we use the term “equitability” for arbitrary properties of interest <h, 

rather than using “interpretability” in general and reserving “equitability” for cases in which <f> specifically reflects some notion 
of relationship strength. We do this because in this paper <f> always reflects a notion of relationship strength. However, we note 
that the concepts and tools here can be readily applied even if this is not the case. 
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H 0 :O=x 0 in [0,1] 
H 1 :0=x 1 in [0,1] 


H 0 : O = 0.3 H 0 ; O = 0.3 

H 1 :<S>=x 1 in [0,1] H x : O = x x in [0,1] 



(a) (b) (c) 

Figure 1: Equitability as a generalization of power against independence, (a) The power function of a size-o 
right-tailed test based on a statistic p with null hypothesis Ho : <f> = 0.3. The curve shows the power of the test as 
a function of xi, the value of <f> in the alternative hypothesis, (b) The power function can be depicted instead as a 
heat map. (c) Instead of considering just one null hypothesis/critical value, we can consider a set of null hypotheses 
(with corresponding critical values) of the form Ho : <I> = xo and plot each of the resulting power curves as a heat 
map. The result is a plot in which the intensity of the color in the coordinate (xi,xo) corresponds to the power of a 
size-o right-tailed test based on p at distinguishing H\ : <f> = x\ from Ho : <f> = xq. A 1/d-equitable statistic is one 
for which this power surface attains the value 1 — a within distance d of the diagonal along each row. 


where D is a sample of size n from Z. 

The statistic p is 1 Id-reliable with respect to <f> on Q at x with probability 1 — a if and only if the 
diameter of Rfg ( x ) is at most d. 

The reliable interval at x is an acceptance region for a size-<a test of the null hypothesis Hq : <f> = x. This 
is a convex hull of central intervals of the sampling distributions of p over all distributions Z E <E -1 ({x}). If 
there is only one Z such that $>(Z) = x, then the reliable interval is simply a central interval of the sampling 
distribution of p on Z. 

Figures 2a and 2b show schematic illustrations of reliable intervals in the case where Q is a set of noisy 
functional relationships, <f> = R 2 , and p is the sample Pearson correlation coefficient. In Figure 2a, the 
set Q contains only one relationship type: linear. Consequently, each possible value of R 2 has only one 
distribution Z E Q with that R 2 . In this case, the reliable interval at that R 2 value is simply a central 
interval of the sampling distribution of the sample correlation. In Figure 2b, the set Q contains not one but 
three relationship types: linear, exponential, and parabolic. This means that at every R 2 value there are 
three different distributions in Z E Q with that R 2 , and consequently three different sampling distributions 
of the sample correlation. In this setting, the reliable interval at that R 2 value is the smallest interval that 
contains the union of the central intervals we constructed of those three sampling distributions. 

Having defined the reliable interval as an acceptance region, we can now define the interpretable interval 
as an interval estimate of <f>. 

Definition 3.3 (Reshef et al. [2015b]). Let : M 2n -T [0,1] be a statistic, and let y,a E [0,1]. The 
o-interpretable interval of (p at y , denoted by 1% (■ y ), is the smallest closed interval containing the set 

{• x E [0, 1] : y E R% (x)} . 

The statistic (p is 1 /d-interpretable with respect to <f> on Q at y with confidence 1 — a if and only if the 
diameter of (y) is at most d. 

Figure 2c shows schematic illustrations of two different interpretable intervals in the setting discussed 
above, in which Q is a set of noisy functional relationships with three different function types (linear, 
exponential, parabolic), <£> = R 2 , and p is the sample Pearson correlation coefficient. 
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© (e.g. R 2 ) 


© (e-g. R 2 ) 


(b) 


( c ) 


Figure 2: A schematic illustration of interpretability/equitability with three relationship types: linear (blue), 
exponential (red), and parabolic (yellow). Here the property of interest (<£>) is R 2 and the statistic in question 
(fp) is the sample Pearson correlation coefficient p. (a) A plot of central intervals of the sampling distributions of 
(p — p against R 2 (Z) for Z G 2, when Q consists only of linear relationships with varying amounts of added noise; 
one reliable interval is pictured. Since there is exactly one relationship in Q corresponding to each R 2 value, the 
reliable interval is simply a central interval of the relevant sampling distribution, (b) The analogous plot in the 
case where Q contains noisy functional relationships ranging over three different functions: linear (blue), exponential 
(red), and parabolic (yellow). Now the reliable interval interval is the smallest interval containing all three of the 
relevant central intervals, (c) The same plot, with interpretable intervals pictured. The interpretable interval at each 
value of p is composed of the R 2 values whose reliable intervals contain that value of p. The shorter the interpretable 
intervals, the more interpretable/equitable the statistic. The worst-case interpretable interval is denoted by a solid 
red line; an additional interpretable interval is shown with a dashed red line. The thumbnails to the right of each 
interval show representative relationships from the endpoints of that interval, both of which have the same p but 
dramatically different values of R 2 . 


When we are discussing the interpretability or reliability of a statistic, we need to speak about more than 
one x or y value at a time. There are many potential ways to do this. Here we limit ourselves to two basic 
ones. 

Definition 3.4 (Reshef et al. [2015b]). A measure of dependence is worst-case l/d-reliable (resp. inter¬ 
pretable) if it is 1/d-reliable (resp. interpretable) at all x (resp. y) G [0,1]. 

A measure of dependence is average-case 1/d-reliable (resp. interpretable) if its reliability (resp. inter¬ 
pretability), averaged over all x (resp. y) G [0,1], is at least 1/d. 

Here and throughout, we use “worst-case” to refer to the worst-seen performance, as opposed to a proven 
bound, and we use “interpretability” with no qualifier to refer to worst-case interpretability. 

To gain some intuition for the definition of interpretability, let us consider what values d can take. The 
lowest possible interpretability happens when one of the interpretable intervals has size 1. In this case, the 
(worst-case) interpretability of the statistic is 1 as well. In the best case, when all interpretable intervals 
of a statistic are of size 0, the interpretability is oc, and the statistic is said to be perfectly interpretable. 
(As before, the perfect case is only expected to arise, if at all, as a property of the population value of the 
statistic.) 

To complete our example, let us find the worst-case interpretability of the sample correlation coefficient 
in the example of noisy functional relationships depicted in Figure 2c. To do this, we locate the widest 
interpretable interval in the figure; this happens to be the lower of the two intervals pictured. If the length 
of this interval is d, the sample Pearson correlation coefficient is worst-case 1/d-interpretable with respect 
to R 2 on our set Q. Thus, the shorter the interpretable intervals, the more interpretable the statistic. 

3.3 The equivalence of the two formalizations 

It turns out that equitability and worst-case interpretability as defined above are equivalent under modest 
assumptions [Reshef et al., 2015b]. We state this result below. 

Theorem 3.5 (Reshef et al. [2015b]). Let Q be a set of standard relationships, let : Q —>> [0,1], and let 
0 < a < 1/2. Let (p be a statistic with the property that maxi?f (x) is a strictly increasing function of x. 
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Then for all d > 0, the following are equivalent. 

1. <f> is 1/d-equitable with respect to $ with confidence 1 — 2a. 

2. (p is worst-case 1 / d-interpretable with respect to <f> with confidence 1 — a. 

This result can be interpreted in two ways. One interpretation is that a statistic that allows us to 
approximately rank the relationships in a data set by strength as measured by <f> will also allow us, for any 
xo, to find all the relationships in the data set that have strength at least xo as measured by 4>, and vice 
versa. Another interpretation arises if reflects relationship strength, in particular if <f> = 0 corresponds to 
the relationships in Q exhibiting statistical independence. If this is the case, then the above theorem tells 
us that equitability is a generalization of power against statistical independence on Q. 

This is good news and bad news. On the one hand, it provides a link between equitability and power and 
clarifies the relationship between the two. On the other hand, it shows that equitability - by virtue of being 
stronger than power against independence - will also be more difficult to achieve, as it requires simultaneously 
attaining power against a much larger set of null hypotheses. This hints at a trade-off between equitability 
and power against independence for which we provide empirical evidence in Section 6. 

3.4 Equitability on functional relationships 

So far we have discussed equitability in general, conceptual terms, and it has many different concrete inter¬ 
pretations depending on the choice of 4> and Q. We define here a concrete instantiation of equitability on 
functional relationships that is used throughout this paper. To do this, we first must state what we mean 
by “functional relationship”. 

Definition 3.6 (Reshef et al. [2015b]). A random variable distributed over M 2 is called a noisy functional 
relationship if and only if it can be written in the form (X-f-£, /(X) -\-e f ) where / : [0,1] M, X is a random 
variable distributed over [0,1], and e and s' are (possibly trivial) random variables. We denote the set of all 
noisy functional relationships by T . 

Equitability on functional relationships in the sense of Reshef et al. [2011] and Reshef et al. [2015b] now 
just amounts to the use of R 2 as the property of interest. 

Definition 3.7 (Reshef et al. [2015b]). Let Q C T be a set of noisy functional relationships. A measure 
of dependence is worst-case (resp. average-case) 1/d-equitable on Q if it is worst-case (resp. average case) 
1/d-equitable with respect to R 2 on Q. 

In this paper we often abuse terminology by simply writing “equitability” to mean equitability with 
respect to R 2 on various sets of functional relationships as defined above. Alternative definitions of this 
concept with other sets Q and functions have been proposed. These are discussed in detail in Reshef et al. 
[2015b]. 

3.5 Equitability: an example 

Using the framework reviewed here, Figure 3a demonstrates how one might analyze the equitability of a 
statistic in practice from the standpoint of interpretable intervals. We take as an example the sample Pearson 
correlation coefficient (p). This statistic is not a measure of dependence in the sense that its population 
value can be zero even in cases of non-trivial dependence. However, we analyze it here due to its widespread 
familiarity and the intuitiveness of its scores. 

In this example, as before, our property of interest will be <!> = R 2 . The set of standard relationships Q 
will be a set of noisy functional relationships of the form (X + 5, /(X) + £ f a ) with 6 = 0, e' a ~ AT(0, cr 2 ), and 
/ ranging over the functions in Table A.l. 

To analyze the equitability of p, we generate, for 41 different noise levels a and for every function / in our 
set, 500 samples from the relationship Z = (X, /(X) + e f a ) with a sample size of n = 500. Using these, we 
estimate the 5th and 95th percentiles of the sampling distribution of p on Z. These allow us to estimate the 
reliable interval at the value of R 2 corresponding to each noise level. The reliable intervals then enable us 
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Figure 3: Examples of equitable and non-equitable behavior on a set of noisy functional relationships. (Reproduced 
from Reshef et al. [2015b].) (a) The equitability with respect to R 2 of the sample Pearson correlation coefficient p 
over the set Q of relationships described in Section 3.5, with n = 500. Each shaded region is an estimated 90% central 
interval of the sampling distribution of p for a given relationship at a given noise level. The fact that the interpretable 
intervals of p are large indicates that a given p value could correspond to relationships with very different R 2 values. 
This is illustrated by the pairs of thumbnails corresponding to relationships with the same p but different R 2 values. 
The largest interpretable interval is indicated by a red line. Because it has width 1, the worst-case equitability with 
respect to R 2 in this case is 1, the lowest possible, (b) An illustration of a hypothetical measure of dependence that 
achieves perfect equitability in the large-sample limit. Here, the population quantity p depends only on the R 2 of 
the relationships and increases monotonically with R 2 . Thus, p can be used as a proxy for R 2 on Q with no loss. 
Thumbnails are shown for sample relationships that receive the same p score, which corresponds to the fact that 
they have equal R 2 scores. 


to construct interpretable intervals, and our estimate of the equitability is then the reciprocal of the length 
of the longest interpretable interval. 

The fact that the interpretable intervals at many values of p are large indicates that a given value of p 
could correspond to samples from relationships of different types that have very different R 2 values. This 
is illustrated by the pairs of thumbnails corresponding to relationships that received the same p but have 
different amounts of noise. This means that p is not very interpretable with respect to R 2 on this set Q 
and is thus said to have poor equitability with respect to R 2 on Q. As a contrast, Figure 3b contains a 
hypothetical illustration of the notion of perfect equitability , which would require that all the interpretable 
intervals be of size 0. 

Of course, equitability is a function not only of the method in question but also of the standard rela¬ 
tionships and the property of interest. For instance, while p has poor equitability with respect to R 2 on the 
Q above, it is (trivially) asymptotically perfectly equitable with respect to the correlation on the set Q of 
bivariate normals. 


4 Equitability analysis 

Having reviewed equitability and how to quantify it, we turn to evaluating the equitability of MIC e and 
several other leading measure of dependence. We begin by quantifying the equitability of each measure of 
dependence using interpretable intervals. This is followed by an alternate visualization of the equitability of 
each measure of dependence using conventional power analysis via the connection described in the previous 
section. 
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4.1 Setting up the analysis 

4.1.1 Choice of methods to analyze 

The set of existing measures of dependence is too large for us to analyze exhaustively, even in a paper that 
aims to be comprehensive. We therefore strive to include in our analysis a collection of methods that is 
representative of the broad approaches prevalent in the field today. 

Grid-based methods The methods based on the maximal information coefficient and the total informa¬ 
tion coefficient can be viewed as exploring the space of possible grids that can be drawn on the sampled 
data, assigning a score to each grid via some metric, and then aggregating the scores. For MIC [Reshef 
et ah, 2011], the metric is a normalized mutual information score and the aggregation is a supremum. MIC e 
[Reshef et ah, 2015a] is similar except it explores a more restricted set of grids. TIC e [Reshef et ah, 2015a] 
is like MIC e except it aggregates by summation. 

We also include other recent grid-based methods introduced since the maximal information coefficient 
[Reshef et ah, 2011]. HHG [Heller et ah, 2013] uses Pearson’s y 2 test statistic as its score, explores a set 
of two-by-two grids defined by individual data points, and aggregates by summation. Though similar to 
Hoeffding’s D [Hoeffding, 1948] in that it considers only two-by-two grids, it differs in the use of the y 2 
test statistic. S DDP [Heller et ah, 2014] explores a larger set of grids defined by subsets of the data points, 
uses non-normalized mutual information as its score, and also aggregates by summation. 2 Another notable 
grid-based method introduced recently is dynamic slicing [Jiang et ah, 2014], which like MIC explores all 
possible grids and aggregates by maximization, but uses as its score a version of mutual information that is 
regularized according to a prior on the space of possible grids. We did not include dynamic slicing in our 
comparison, however, because it is formulated only for performing a ^-sample test whereas our focus here is 
on measuring dependence between two continuous random variables. 

Mutual information estimation Since many of the grid-based methods we consider either use some 
form of mutual information as their score or have variants that do, we also included a standard mutual 
information estimator introduced by Kraskov [Kraskov et ah, 2004]. This estimator was compared against 
MIC in previous work [Reshef et ah, 2011, 2013; Kinney and Atwal, 2014; Reshef et ah, 2014], but those 
comparisons were more limited in scope and did not include MIC e . (For convenience, in this work we 
represent the estimated mutual information values in terms of the squared Linfoot correlation Speed [2011]; 
Linfoot [1957], defined by L 2 (X,Y) = 1 — 2~ 2I( ^ X - >Y \ which takes values in [0,1].) 

Distance/kernel-based statistics We include distance correlation (dCor) [Szekely and Rizzo, 2009], an 
analogue of the Pearson correlation coefficient that is defined using a different notion of covariance that 
uses pairwise distances between points. In addition, we include the Hilbert-Schmidt Information Criterion 
(HSIC) [Gretton et ah, 2005, 2008], a more general statistic defined on reproducing kernel Hilbert spaces of 
which dCor is a special case [Sejdinovic et ah, 2013]. 

Correlation-based methods As an intuitive benchmark for the reader, we include the Pearson correlation 
coefficient (p). However, there are many successful tools that use p after computing a non-linear transfor¬ 
mation of the data. We include perhaps the best-known one, maximal correlation [Renyi, 1959], which given 
random variables X and Y searches for arbitrary measurable functions / and g such that p(f(X),g(Y)) is 
maximized. There is no known algorithm for finding the optimal / and g in general, but the (approximate) 
method of alternating conditional expectations [Breiman and Friedman, 1985] is widely used and we use it 
here as well. We also include a more recent related method, the randomized dependence coefficient [Lopez- 
Paz et ah, 2013], which applies many random transformations to X and Y and then searches for the linear 
combinations of the transformed features that maximize the correlation. 

2 There are other variations on these statistics presented in Heller et al. [2013, 2014]. However, we omit those results as they 
were generally similar or worse than the ones we display. 
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4.1.2 Choice of Q, <f>, and sample sizes 


In an ideal world, when assessing equitability in a specific instance, we would know the true underlying model 
Q governing the relationships in our data set. Knowledge of Q would, for example, include information about 
the types of relationships present and the noise distribution (e.g., Gaussian, zero-mean, heteroscedastic, etc.). 
Of course, in reality we generally do not have this information and, to make matters worse, the results of 
an equitability analysis may depend strongly on the choice of Q. Thus, in evaluating the equitability of 
measures of dependence, it is important to aim for robustness: we would like to have a measure of dependence 
with good equitability over as many different relationship types as possible. 

However, there is a central tension between the need to use as large a set Q as possible in order to assess 
robustness and the need to use a Q that is sufficiently small that a reasonable property of interest <f> can 
be defined for the relationships in Q. To take an extreme example, setting Q to be the set of all bivariate 
relationships would certainly ensure that we do not leave any stone unturned, but at the same time it begs 
the original question of how one can measure relationship strength in such a general context. 

For this reason, following Reshef et al. [2011], we choose to focus on noisy functional relationships since 
these represent a broad, easily definable class of relationships commonly found in practical applications that 
comes with an intuitive and natural measure of relationship strength: R 2 , the coefficient of determination 
with respect to the generating function. To ensure robustness, we vary the relationships tested along as 
many dimensions as possible including relationship type, the type of noise added, marginal distributions, 
and sample size. 

We note here that our goal in this analysis is not to establish the equitability of any method across the 
entire set of noisy functional relationships. In fact, under some of the sampling/noise models we considered, 
there are functions whose inclusion leads to poor equitability across all methods. We therefore attempted 
to characterize as broad a set of functions as possible that still allowed for non-trivial equitability. 

To that end, our analyses include some 16-21 different functional relationships (depending on noise model; 
see Appendix A.l), each with increasing levels of additive Gaussian noise, considered under twelve different 
sampling/noise models, at four sample size regimes (n = 250,500,5000, and the infinite data limit). Each 
of the 12 sampling/noise models Q is defined using a combination of an independent variable marginal 
distribution from the set 


points sampled evenly along the curve described by f(X ) 
points sampled evenly along the X range 
points sampled uniformly along the curve described by f(X) 
points sampled uniformly along the X range 


(■ E f(X )) 

{Ex) 

l U f(X)) 

0 Ux ) J 


and a noise distribution from the set 


normally distributed noise added to the dependent variable (A f y ) 
normally distributed noise added to both variables (Af XJ Af y ) 
normally distributed noise added to the independent variable (A f x ) 


We refer to these noise models using abbreviations of the form Ef(x)W y \, which would correspond to a 
model in which the independent variable is sampled evenly along the curve described by f(X) and Gaussian 
noise is added only to the dependent coordinate. Appendix A.l contains definitions of the functions used. 


4.1.3 Parameters of the analysis 

For each Q, for each sample size n, we examine 41 different R 2 values evenly spaced in the unit interval. At 
each of these R 2 values, we generate 500 independent realizations of a sample of size n from each relationship 
in Q with the given R 2 value. These are used to estimate sampling distributions for (p. (See Appendix A.2 
for details regarding data generation.) 

4.1.4 Parameters of statistics tested 

Several of the methods tested are parametrized, including MIC e , HSIC, the Kraskov mutual information 
estimator, RDC, and S DDP . For each of these methods, we performed a parameter sweep to assess the effect 
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of parameter settings on equitability. In most cases, we found that parameter settings did not significantly 
affect equitability and so we present here results obtained with default parameters. For MIC e and the 
Kraskov mutual information estimator, however, parameter settings did affect equitability. Therefore, for 
these methods, we present for each sample size the best results across parameter values tested. Results for 
all parameter values tested can be found in the online supplement at http://www.exploredata.net/ftp/ 
empirical_supplement.zip. In Section 6.2 we discuss guidelines for how to set parameters for MIC e more 
generally. 

4.1.5 Quantification of equitability 

The equitability of each measure of dependence is quantified using interpretable intervals, as discussed in 
Section 3.5. In the equitability plots presented, shaded regions denote central intervals containing 90% 
probability mass of the sampling distribution of each measure of dependence at each R 2 value; these reliable 
intervals correspond to 0.05—interpretable intervals. In general, we report both average-case and worst- 
case equitability in our analyses, and the interval plotted in red on each plot represent the worst-case 
0.05—interpretable interval for that plot. (The shorter the interval, the more equitable the statistic.) 

4.2 Results 

Figures 4 and B.l demonstrate the equitability of MIC e , distance correlation, maximal correlation, HSIC, 
the Kraskov mutual information estimator, RDC, and S DDP for noise models Ef(x) W x , M y \ and Ef(x) W y \ 
at a range of sample sizes. Results for all other noise models are presented in the supplemental materials, 
along with results for TIC e , HHG, and p. Tables B.l and B.2 summarize the worst-case and average-case 
equitability, respectively, for all measures of dependence across all models and sample sizes, as measured by 
0.05—interpretability intervals. 

We offer here some discussion of the salient questions answered by these analyses. 

4.2.1 Comparing the equitability of MIC e and mutual information 

Given the connections between MIC e and mutual information, which are discussed in depth in Reshef et al. 
[2015a], it is natural to ask whether the direct estimation of mutual information achieves a similar level of 
equitability to that of MIC e . In general, among the variety of models and sample sizes tested, the answer 
appears to be ‘no’, but we present a more detailed breakdown of the results below. 

Effect of model choice on equitability Figure 5, as well as Tables B.l and B.2, demonstrate the relative 
robustness of the equitability of MIC e to the choice of model Q compared to that of the Kraskov mutual 
information estimator. At each sample size, the equitability of MIC e is fairly stable with respect to the 
variations in noise models and independent variable marginal distributions tested. On the other hand, while 
mutual information estimation sometimes has good equitability, it more often has poor equitability under 
the models tested. More specifically, mutual information estimation can be equitable in models that only 
contain noise added in the dependent coordinate, while MIC e performs equitably even outside this domain, 
such as in the case of models that include noise added to either or both the dependent and independent 
coordinates. The performance of mutual information estimation is also improved when the independent 
variable is stochastic rather than fixed, though this distinction never affects whether it outperforms MIC e 
or not. 

Effect of sample size on equitability Estimating mutual information from finite samples is a challenging 
problem that has inspired many non-trivial methods [Paninski, 2003; Moon et al., 1995; Kraskov et al., 2004], 
and Tables B.l and B.2, as well as Figures 4, B.l, and 5, demonstrate the strong influence of finite-sample 
effects on the equitability of mutual information estimation. Consistent with the fact that MIC* is uniformly 
continuous while mutual information is not [Reshef et al., 2015a], estimation of MIC* suffers less from this 
problem: for n = 250 and n = 500, MIC e has both superior worst-case and average-case equitability over 
mutual information estimation (using k = 1, 6, 10, and 20 in the Kraskov estimator) in every model Q 
tested, and in most cases by substantial margins. For n = 5000, mutual information estimation has better 
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Figure 4: The equitability of measures 
of dependence on a set Q of noisy func¬ 
tional relationships. [Narrower is more 
equitable.] The relationships take the 
form (X + e,f(X) + e') where e and 
s' are i.i.d. normals of varying ampli¬ 
tude, and relationship strength is quan¬ 
tified by <E> = R 2 . The plots were con¬ 
structed as described in Figure 2. In 
each plot, the worst-case interpretable 
interval is indicated by a red line, and 
both the worst- and average-case eq¬ 
uit ability are listed. The fact that 
the worst-case interpretable intervals of 
MIC e are small indicates that a given 
MIC e score reflects the coefficient of de¬ 
termination (R 2 ) with respect to the 
generating function / with a relatively 
weak dependence on the function / in 
question. That is, MIC e has high eq- 
uitability with respect to <h = R 2 for 
this choice of Q. Mutual information, 
estimated using the Kraskov estimator, 
is represented using the squared Lin- 
foot correlation. For every parametrized 
statistic whose parameter meaningfully 
affects equitability, results are presented 
at each sample size using parameter set¬ 
tings that maximize equitability across 
all twelve of the noise/marginal distri¬ 
butions tested at that sample size. 
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Figure 5: A comparison of the equitability of MIC e and mutual information estimation under three noise models 
including the one in Figure 4. [Narrower is more equitable.] Plots were constructed as in Figure 2. In each plot, 
the worst-case interpretable interval is indicated by a red line, and both the worst- and average-case equitability are 
listed. As in Figure 4, results for both statistics are presented for each sample size using parameter settings that 
maximize equitability across all twelve of all twelve of the noise/marginal distributions tested at that sample size. 
Mutual information, estimated using the Kraskov estimator, is represented using the squared Linfoot correlation. 
While mutual information estimation using the Kraskov estimator is equitable at high sample size on some of the 
sets Q that were tested, the equitability of MIC e is more robust to noise model, independent variable marginal 
distribution, and limited sample size. For versions of this analysis using additional independent variable marginal 
distributions, see the supplemental materials. 


equitability than MIC e in settings where there is only noise in the dependent variable, while MIC e has 
superior equitability in all other models tested. Aspects of this phenomenon have previously been noted 
in Reshef et al. [2013], and subsequently in Kinney and Atwal [2014], and Reshef et al. [2014]. 

Equitability in the large-sample limit Departures from perfect equitability can occur either as a result 
of finite sample effects, or because of the lack of interpretability of the population value of the statistic. 
To disentangle these two potential effects, we compare the equitability of MIC* and the Kraskov mutual 
information estimator in the large-sample limit (Figure B.2). This analysis yields two important insights. 
First, it demonstrates that when finite sample effects are minimal, MIC* has both superior worst-case and 
average-case equitability in the four models Q that contain noise added in the independent variable or in 
both the independent and dependent variables, while mutual information is more equitable than MIC* in 
the two remaining settings, where noise is added only in the dependent variable. Second, more generally, it 
shows that neither MIC* nor mutual information is worst-case perfectly interpretable with respect to <F = R 2 
over the sets Q examined. This is not surprising given the broad range of relationships, noise models, and 
independent variable marginal distributions tested. 

Relationship to equitability analysis from Kinney and Atwal [2014] A more limited analysis of the 
equitability of MIC and mutual information estimation was presented in Kinney and Atwal [2014]. There, the 
authors examined the equitability of MIC and mutual information estimation specifically at a large sample 
size (n = 5000) and under one choice of Q (Ef( X )[N y ]). From this, they concluded that mutual information 
estimation was more equitable than MIC. As our analysis here shows, though that is true for this specific 
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choice of Q and sample size, it is not true in general. To the contrary, the general picture seems to be that 
the equitability of estimators of MIC* is more robust than that of estimators of mutual information due to 
a combination of finite-sample effects and differences between the population values themselves. 

For more on this discussion, see the technical comment [Reshef et ah, 2014] published by the authors of 
this paper about Kinney and Atwal [2014]. For a discussion of the theoretical results of Kinney and Atwal 
[2014], see Reshef et al. [2015b] and Murrell et al. [2014]. 

4.2.2 The equitability of p, dCor, maximal correlation, HSIC, RDC, TIC e , HHG, and S DDP 

Figures 4 and B.l, as well as Tables B.l and B.2, demonstrate that p, distance correlation, maximal correla¬ 
tion, HSIC, RDC, TIC e , HHG, and S DDP all display relatively poor equitability over the models Q tested. 
(We note that these methods were not designed with equitability in mind and so do not make claims about 
equitability.) Of these methods, maximal correlation displays the highest degree of equitability. Additionally, 
the equitability profiles of both dCor and RDC are similar to that of the correlation p. 

4.3 Alternate equitability analysis via connection with statistical power 

Figures 6 and B.3 quantify the equitability of the set of measures of dependence examined above via a power 
analysis. This is achieved as demonstrated in Figure 1. Analyses are presented for the same range of models 
and sample sizes examined in the equitability analysis performed using interpretable intervals, and results 
for all other models are presented in the supplemental materials. 

Assessing equitability using statistical power analysis confirms the conclusions that are reached by the 
quantification of equitability using interpretable intervals above. That is, in this analysis, MIC e is the only 
measure of dependence that is able to distinguish any null hypothesis of the form Hq : R 2 = x o from any 
alternative hypothesis of the form Hi : R 2 = x\ with high power across the full range of models Q and sample 
sizes examined, even when x\ — xq is relatively small. As in the equitability analysis using interpretable 
intervals, the Kraskov mutual information estimator is not able to achieve this task for sample sizes tested 
lower than n = 5, 000, and even at n = 5, 000 it is only able to do so for models that contain noise only in 
the dependent variable. This is true regardless of the choice of parameter used in the Kraskov estimator. 
(See Appendix B for results achieved using additional parameters.) Finally, as before we see that p, distance 
correlation, HSIC, RDC, HHG, and S DDP are highly non-equitable, with maximal correlation being the 
only other measure of dependence tested that displayed any degree of equitability. 

In this analysis, in which Q is a set of noisy functional relationships and the property of interest is 
R 2 , methods such as distance correlation and HSIC, which are traditionally considered to be well powered 
for detecting deviations from independence, do not yield tests that achieve high power, even in the case 
where the null hypothesis is statistical independence. This is due to the fact that even when we consider a 
null of independence, we have a composite alternative hypothesis due the the multiple different functional 
forms present in Q. This requires methods to yield tests that are highly powered at simultaneously detecting 
deviations from independence in all of the relationship types present in Q. The poor power displayed by tests 
based on distance correlation, HSIC, and RDC is due to the fact that, while they may be highly powered at 
detecting deviations from independence in, say, linear relationships, they are worse at simultaneously doing 
so for the more nonlinear relationships. Of course, when both the null and alternative hypotheses are allowed 
to take on non-zero values of R 2 , the task of differentiating between the null and alternative becomes even 
harder as both the null and alternative are now composite, and correspondingly the performance of these 
methods suffers further. 

4.4 Discussion 

In this section we analyzed the equitability with respect to R 2 of MIC e alongside several leading measures 
of dependence, on many different sets of relationships with varying sample sizes, noise types, and marginal 
distributions. Our main finding is that in most (32 out of 36) of the settings we considered, MIC e is 
substantially more equitable than the other methods. In the remaining four settings, all of which had a 
sample size of n = 5, 000 and no noise added in the independent variable, mutual information estimation 
using the Kraskov estimator outperformed MIC e by a small margin; however, the equitability of the Kraskov 
estimator at lower sample sizes or on other noise models is otherwise poor. 


17 



Figure 6: The equitability of mea¬ 
sures of dependence on noisy func¬ 
tional relationships, visualized in 
terms of power. [Redder is more 
equitable.] The set of noisy func¬ 
tional relationships analyzed is the 
same as in Figure 4, and relation¬ 
ship strength is again quantified by 
<f> = R 2 . Plots were generated 
as in Figure 1. The intensity of 
the pixel at coordinate (x\,xq) in 
each heat map shows the power 
of a right-tailed test based on the 
statistic in question at distinguish¬ 
ing the (composite) alternative hy¬ 
pothesis Hi : R 2 = x\ from the 
(composite) null hypothesis Ho : 
<f> = xo with type I error at most 
a — 0.05. An optimal statistic 
would yield tests with 100% power 
for every x\ > xo. MIC e comes 
closest to achieving this ideal, and 
performs particularly well relative 
to other methods at lower sample 
sizes. For each plot, the average 
area under the power curve across 
the entire set of null hypotheses is 
listed. (The maximum achievable 
such area is 0.5.) Mutual informa¬ 
tion, estimated using the Kraskov 
estimator, is represented using the 
squared Linfoot correlation. For 
every parametrized statistic whose 
parameter meaningfully affects eq¬ 
uitability, results are presented at 
each sample size using parameter 
settings that maximize equitability 
across all twelve of noise/marginal 
distributions tested at that sample 
size. 
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As we show later, the equitability of MIC e does seem to come at a price. Specifically, though MIC e 
does, with certain parameter settings, yield tests with good power against independence (see Section 5), 
the settings that confer the equitability demonstrated above do not have this property. This suggests that 
there is an inherent trade-off in the statistic between power against independence and equitability, and in 
Section 6 we establish that this is indeed the case. 

Interestingly, besides MIC e and the Kraskov estimator, the other method with non-trivial equitability 
with respect to R 2 in our experiments is maximal correlation as computed using the method of alternating 
conditional expectations (ACE). This is interesting because, on the one hand, one can show from its definition 
that the squared maximal correlation is bounded from below by R 2 , and on the other hand the lack of 
equitability of maximal correlation in our experiments seems to stem from the ACE method returning 
results below this lower bound. We therefore wonder whether maximal correlation — were it computable 
exactly — would be highly equitable with respect to R 2 . 

The analyses presented in this section demonstrate that equitability with respect to R 2 is achievable to 
a significant extent, at least on the relationships tested here. However, while the noise models, marginal 
distributions, and functions used were chosen to be representative of real-world relationships, they by no 
means form a large enough set to allow us to make claims about the performance of these methods in general. 
Given this state of affairs, a better theoretical understanding of MIC e and also of equitability — with respect 
to R 2 and otherwise — is crucial for allowing us to determine when and to what extent equitability can be 
achieved. Though this is an ambitious goal, we feel it is important for guiding the development of methods 
for coping with the growing complexity of today’s data sets. It is our hope that the empirical insights 
presented here, together with the theory presented in Reshef et al. [2015b, a], will inform and enable further 
investigation of both equitability and MIC e . 

5 Statistical power analysis 

There are many settings that call simply for testing for any deviation from independence rather than rela¬ 
tionship ranking, or in which relationship ranking is simply not feasible. These settings require a measure 
of dependence that yields tests with high power against a null hypothesis of statistical independence. 

Here, we turn to assessing the power against independence of the set of measures of dependence examined. 
This has been done previously, most notably by Simon and Tibshirani [Simon and Tibshirani, 2012]. Our 
analysis expands upon the power analysis performed by Simon and Tibshirani in three key ways. First, we 
examine power not as a function of absolute amount of noise in the alternative hypothesis but rather as a 
function of the R 2 of the alternative hypothesis, allowing us to aggregate across relationship types to gain a 
more global view of the power of each method. Second, for each of the statistics we analyze that has a free 
parameter, we perform a parameter sweep to understand the power of the corresponding tests as a function 
of that parameter, and to determine what the optimal value of the parameter is. Last, we analyze a larger 
set of methods, with a greater variety of sample sizes. The result is an in-depth portrait of statistical power, 
assembled using the best achievable performance of a large number of leading methods. 

5.1 Setting up the analysis 

5.1.1 Choice of methods to analyze 

The methods analyzed were the same as those examined in the equitability analysis. See Section 4.1 for 
more details. 

5.1.2 Choice of relationships and sample size 

For all of the power analyses performed, we use both the set of relationships and noise model (Ux[Ny\) 
chosen by Simon and Tibshirani [Simon and Tibshirani, 2012]. For consistency with the sample sizes used 
throughout this work, we show results for n = 500, but results for all analyses using n = 100 are similar and 
are provided in the supplemental materials. 
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5.1.3 Parameters of the analysis 

In order to make power results for different relationships comparable, we sought to compute power as a 
function of i? 2 , in a manner similar to the equitability analyses above, rather than as a function of absolute 
magnitude of added noise. To do this, we determined, for each of the eight relationship types chosen by 
Simon and Tibshirani 3 , 100 noise levels evenly distributed over the range of noise levels yielding R 2 = 1.0 (no 
noise) and R 2 = 10 -2,5 (substantial noise). (See Appendix A.2.) We then drew 1000 independent samples, 
each of size n = 500, from the corresponding distribution. This was our alternative hypothesis. We also 
drew 1000 independent samples from a corresponding null hypothesis chosen to have the same marginals. 
All analyses were performed at a significance level of 0.05. 

5.1.4 Parameters of statistics tested 

To understand how choice of parameter affects statistical power in the case of each measure of dependence, 
we performed a parameter sweep for each method that has a parameter. 4 To do this, we needed a way of 
quantitatively summarizing power across eight relationship types, so that we could then graph performance 
as a function of parameter value and then choose the optimal parameter value. We did this in two different 
ways. For both ways, having power computed as a function of R 2 , so that power on different relationships 
could be directly compared, was crucial. 

The first way that we summarized power was by computing the area under the power curve for each 
relationship type, integrating with respect to absolute noise level. That is, we computed the power curve 
for a given relationship type (e.g., linear) as a function of amount of noise added, and then computed the 
area under that curve up to a pre-specified limit on the amount of noise (as measured by R 2 ). The resulting 
number measures the expected power of tests based on the statistic in question when the amount of noise 
added in the alternative-hypothesis is chosen uniformly at random. 

The second way that we summarized power was by computing the minimum alternative hypothesis R 2 
necessary to achieve a certain level of power [Kinney and Atwal, 2014]. Another way of thinking of this is 
“what is the maximum amount of noise that can be added to a relationship before power for differentiating 
that relationship from independence drops below a pre-set threshold?” The results presented here use a 
threshold of 50% power; results for other thresholds (95%, 75%, 25%, and 10%) are similar and can be found 
in the online supplement. 

5.2 Results 

Figure 7 contains quantitative rankings of the measures of dependence by the power of their corresponding 
tests for independence, using optimal parameter values determined by each of the two methods described 
above. The parameter sweeps themselves, which characterize power against independence as a function of 
statistic parameters, are presented in Figures C.l and C.2. 

This analysis yields several insights, which we discuss below. 

5.2.1 Average power across relationship types 

Let us first use the average power across relationship types to rank the measures of dependence from most to 
least powerful over this set of relationships. Doing so using the quantification of power in Figure 7a 5 yields 

3 Note that one of the relationship types chosen by Simon and Tibshirani was a circle. Since this relationship is not a noisy 
functional relationship, one cannot truly discuss its R 2 . Therefore, as a heuristic workaround, we defined the R 2 of a noisy 
circle to be the average of the R 2 values, computed separately, of the top and bottom halves. 

4 Some methods, such as RDC, will in the future automatically select optimal parameters in a relationship-type-dependent 
way Lopez-Paz [2015]. 

5 Though this quantification of power computes the area under the power curve integrating with respect to absolute noise 
level, one could integrate with respect to R 2 instead. Doing so would measure the expected power of each statistic on an 
alternative hypothesis with a randomly chosen R 2 . When this is done and optimal parameters are chosen for each method, the 
resulting ranking is 

TIC e > MIC e > S DDP > MIC > HHG > max. corr. > I > MIC 0r ig. param > RDC > dCor > HSIC > p 

This ranking makes sense because integrating with respect to R 2 rather than absolute noise level emphasizes performance on 
stronger relationships, which is more similar to the type of performance quantified by equitability. Correspondingly, the optimal 
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Figure 7: Measures of dependence ranked by the power of their corresponding independence tests. For each measure 
of dependence and each relationship type, power was quantified using (a) the area under the power curve [higher 
is more powerful /, or (b) the minimal R 2 at which at least 50% power is achieved [lower is more powerful]. The 
collection of these scores across relationship types is then plotted for each method along with quartiles, and both 
average- and worst-case performance across relationship types are listed. Optimal parameter values for each test 
statistic were chosen to maximize average-case performance; see (a) Figure C.l, or (b) Figure C.2. The MIC statistic 
from Reshef et al. [2011] with the parameters used in Simon and Tibshirani [2012] is labeled in red; there is a 
substantial improvement in power when an optimal parameter is chosen. A further improvement in power is attained 
by MIC e , and the performance of TIC e is state-of-the-art. The sample size was n = 500; results are similar with 
n m 100 and, for (b), with power thresholds besides 50%. (See supplementary materials.) 
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(from most to least powerful): 

S DDP > TICe > dCor > HH Q > max . corr . > HSIC > MIC e > RDC > p > MIC > I > MICorig. param 

Doing the same using the quantification of power in Figure 7b yields (from most to least powerful): 

TIC e > MIC e > S DDP > MIC > HHG > max. corr. > I > MIC 0r ig. param > RDC > dCor > HSIC > p 

When the largest outlier, a high-frequency sinusoid, is removed from the analysis in Figure 7b (and optimal 
parameters are re-chosen accordingly), the ordering is as follows * * 6 : 

gDDP > TICe > max CQrr > HHQ > mCe > RDC > HSIC > i > dCor > MIC > MICorig. param > P 

(See online supplement.) Finally, when the two largest outliers, a high-frequency sinusoid and a circle, are 
removed from the analysis, the ordering is as follows: 

gDDP > TICe > dC()r > max CQrr > MICg > HHG > HSIC > RDC > MIC > i > MICorig. param > P 

(See online supplement.) The orderings produced by these analyses are relatively robust to sample size and 
power threshold used, with TIC e or S DDP generally performing the best and occasionally swapping with 
each other as power threshold is varied. Results obtained with n = 100 and using 95%, 75%, 25%, and 10% 
power thresholds are provided in the supplemental materials. 

Several aspects of these rankings merit mention. First, state-of-the-art performance is shared between 
TIC e and Heller and Gorfine’s S DDP . This is interesting because the latter statistic is in fact closely 
related to the theory behind the maximal and total information coefficients in that it too is an aggregation 
via summation of mutual information scores taken over many different grids. Thus, these results provide 
evidence that the basic approach of aggregating mutual information scores over a large set of grids, whether 
via the characteristic matrix or other statistics, is a fundamentally promising avenue for thinking about 
dependence. 

Second, the average power of independence testing using MIC e , when parameters are optimized for the 
task of relationship detection rather than ranking, is competitive with the state of the art. In particular it 
is higher than the power of its predecessor MIC [Reshef et ah, 2011], which estimates the same population 
quantity (MIC*). This demonstrates that the improved bias/variance properties of MIC e relative to MIC 
[Reshef et ah, 2015a] indeed translate into an improvement in power. 

We note parenthetically that the power of the MIC statistic from Reshef et al. [2011] is substantially 
higher than has been previously reported. This discrepancy is due to the fact that previous analyses that 
examined the power of MIC used the default parameter setting (a = 0.6), which was intended to maximize 
equitability rather than power against independence. As this analysis shows, lower values of a should be 
used for testing for independence. As we show in Section 6, the same statement holds for MIC e , and both 
statements follow from a more general power-equitability trade-off. 

Our final — and perhaps most important — observation about our results is that the differences in power 
between most of the best-performing methods appear rather small. And indeed, an analysis using many of 
these methods on a real gene expression data set [Heller et ah, 2014] shows that this observation is true in 
practice. For example, of the 3312 significant relationships found in the data set using a statistic related to 
S DDP , 3199 (97%) were also detected by HHG, and the latter found only 84 other relationships; 2845 (86%) 
were also detected by dCor on ranks, and the latter found only 44 other relationships; and 2445 (74%) were 
also detected even simply by computing the Pearson correlation coefficient on ranks. MIC e and TIC e were 
not run in this analysis, but the simulation results presented above lead us to believe that they would also 
have recovered a very similar set of relationships had their corresponding independence tests been used on 
this data set. 7 

parameters determined for the methods in this analysis were more similar to the parameters yielding optimal equitability. For 

this reason, we did not focus on this method for quantifying power against independence. 

6 We chose to remove outliers rather than use median power because since a) the power values for different function types 
often rank in the same order across methods, and b) there are only eight such numbers and they each vary considerably among 
methods, the median is very sensitive to the performance of each method on only one or two particular function types. 

7 MIC was run for this analysis, but with the default value of a = 0.6, which yields very poor power against independence. 
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5.2.2 Worst-case power against independence across relationship types 

In addition to considering average-case power across relationship types, it is also important to examine 
worst-case performance. To measure this, we consider the lowest relationship strength x at which each 
independence test is guaranteed to detect, with a given amount of power, all relationships with strength at 
least x regardless of relationship type. When a small such x exists, the statistic in question is said to have a 
low detection threshold Reshef et al. [2015b]. This implies that the corresponding independence test will not 
overlook important relationships because of the test statistic’s systematically assigning them lower scores. 
As described in Reshef et al. [2015b], low detection threshold is related to equitability: an equitable statistic 
provably has a low detection threshold on its set of standard relationships, whereas the converse is not true. 

The detection threshold of the independence tests we consider can be read from Figure 7b: for if x is 
the maximum, across relationship types, of the R 2 required to achieve 50% power on each relationship type, 
then x is also the minimal R 2 such that we can guarantee at least 50% power on any relationship with an 
R 2 of x regardless of type. 

As the figure shows, the detection threshold of TIC e and MIC e on the set of relationships examined is an 
order of magnitude lower than the detection thresholds of the other statistics we evaluated. This phenomenon 
is robust to power thresholds besides 50%; see the online supplementary materials. It implies that TIC e is 
a good candidate for a “first-pass” filtering of the relationships in a data set before other, more fine-grained 
analyses are conducted. In contrast, the high detection thresholds of the other statistics imply that, for a 
fixed relationship strength, their power against independence may be more sensitive to relationship type. 
Using such statistics for pre-processing may therefore result in certain relationship types being missed in 
downstream analyses. 

5.2.3 Power on specific relationships 

Finally, to obtain a more fine-grained picture of the power of the methods we consider on specific relationship 
types, we also re-created the specific power analysis from Simon and Tibshirani [2012] with optimal parameter 
choices for each method, as above. The results are shown in Figure 8. Note that, in order to maximize our 
ability to discern between power curves generated by different tests within each relationship type, in this 
analysis we followed Simon and Tibshirani [2012] by plotting power as a function of absolute noise level 
rather than the population R 2 . This differs from the analyses above, and means that power levels are not 
directly comparable across relationship types. 

Similarly to our other results, the optimal parameter choices used here cause the power of tests based on 
several of the statistics included in this analysis to be better than previously reported [Simon and Tibshirani, 
2012; Gorfine et al., 2012; Lopez-Paz et al., 2013; Kinney and Atwal, 2014; Jiang et al., 2014]. For instance, 
we again see here that the power of MIC is substantially improved. We additionally see that the power of 
MIC e and TIC e is quite good across this set of relationships. This analysis also illustrates that each measure 
of dependence tested indeed has its own strengths and weaknesses. For example, distance correlation and 
HSIC are relatively better powered to detect linear dependence than MIC e and TIC e , but are relatively 
worse at simultaneously detecting most of the other forms of dependence tested. In contrast, S DDP appears 
to have a similar profile to that of TIC e , which again makes sense given the fact that S DDP , like TIC e , is 
also a grid-based method with a mutual information-based score that aggregates by summation. 

5.3 Discussion 

In this section we analyzed the power of independence tests based on several leading measures of dependence, 
including TIC e and MIC e , on the set of relationships chosen by Simon and Tibshirani [Simon and Tibshirani, 
2012]. Our analysis differs from previous ones in that we have aggregated results across relationship types, 
performed parameter sweeps for all the methods that have parameters, and examined a large set of methods 
and sample sizes. 

Our main finding is that TIC e , along with Heller and Gorfine’s S DDP , provides state-of-the-art perfor¬ 
mance on average over the relationship types examined. This is significant because TIC e is trivial to compute 
once MIC e has been computed: TIC e is the sum of the entries of a matrix whose maximal entry is MIC e . 8 

8 The parameter a of TIC e that leads to optimal power against independence may not equal the parameter a used for the 
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Figure 8: A re-creation of the power analysis performed by Simon and Tibshirani [Simon and Tibshirani, 2012], 
with optimal parameter choices for each statistic. Power against a null hypothesis of statistical independence for the 
relationships examined in Simon and Tibshirani [2012], at 50 noise levels for each relationship and n =s 500. For each 
statistic that has a parameter, an optimal value for the parameter was chosen as described in Figure C.l. (For a 
version with n — 100 see supplementary materials.) 
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Moreover, the power of TIC e on individual relationship types remained high across relationship types; there 
was no one relationship type that testing for independence using TIC e would cause us to overlook with high 
probability. Our results therefore point to a promising and computationally efficient strategy for exploratory 
data analysis: first, simultaneously compute both MIC e and TIC e on all variable pairs in a data set. Then 
discard pairs declared insignificant by TIC e and examine the MIC e scores of the remaining pairs. This way, 
the multiple-testing burden is borne by the state-of-the-art power of TIC e , but the significant relationships 
can still be ranked equitably using MIC e . We remark that using S DDP together with MIC e in an analogous 
strategy would not be optimal for two resaons. First, such a strategy would be slower, both because S DDP 
must be computed independently of MIC e whereas TIC e need not be, and because S DDP itself is slower 
to compute than MIC e /TIC e . (See Section 7 for more on running times.) Second, since the power against 
independence of S DDP appears more sensitive to alternative hypothesis relationship type, it seems that fil¬ 
tering relationships by S DDP is more likely to result in important relationships being eliminated prematurely 
because of their relationship type. 

Our analysis also showed that the power against independence of tests based on MIC e is greater than 
that of tests based on its predecessor, MIC, and in particular that MIC e yields tests with power close to the 
state of the art. However, these results require a setting of the parameter a of MIC e that differs from that 
used for optimal equitability, suggesting a trade-off between power against independence and equitability 
that we study in the following section. Additionally, we found that the power against independence of most 
of the methods tested varies considerably across different alternative hypothesis function types, whereas this 
sensitivity is substantially weaker for MIC e and TIC e . 

Finally, we observed that, at least in the bivariate setting, the performance of many of the leading 
methods appears quite similar, even on real data. This last observation leads us to question whether the 
magnitude of a method’s power against independence ought to be the only measure of that method’s utility. 
There are cases in which the answer is ‘yes’, such as when we wish to perform an independence test between 
two high-dimensional variables whose result is the end-goal of our analysis. However, in data exploration 
scenarios in which existing measures of dependence already reliably identify thousands of relationships, it 
may be more important to be able to prioritize those relationships for follow-up, rather than to discover 
a small number of additional relationships whose strength, and therefore scientific promise, is uncertain. 
Solving the data exploration problem well requires us not just to maximize the number of relationships we 
detect, but also to think about how the statistic we choose to use will influence which relationships we find. 
Indeed, this issue is what inspired the original work on MIC and equitability Reshef et al. [2011], but we 
believe the questions regarding the right frameworks for understanding data exploration problems continue 
to pose numerous interesting challenges. 

6 The power-equitability trade-off and parameter choice for MIC e 

The above analyses establish that MIC e can be both highly equitable and provide high-powered tests for 
detecting deviations from independence. However, in each analysis the parameter a of MIC e was chosen 
to optimize the objective in question, and the parameter value that yields optimal equitability is different 
from the value that yields optimal power against independence. This suggests that there may be a trade-off 
between these two objectives that is being captured by the choice of this parameter [Reshef et ah, 2013]. 

Such a trade-off also seems plausible given the equivalence proven in Reshef et al. [2015b] between 
equitability and power against a range of null hypotheses corresponding to different relationship strengths. 
After all, if equitablity is about simultaneously achieving high power against many null hypotheses, then 
“no free lunch”-type considerations imply that to attain this objective we may have to give up some of the 
power we previously had against the specific null hypothesis of independence. 

Here we establish that such a trade-off does indeed appear to exist within each of the parametrized 
methods we consider. We then discuss the implications of this trade-off for how one should choose parameters 
when using MIC e in practice. 

computation of MIC e if, for instance, the latter is being computed with equitability as a goal. In this case, the total runtime will 
equal the runtime of the method with the greater value of a , since increasing a just grows the portion of the equicharacteristic 
matrix that is computed. In most situations, we expect that the value of a desired for MIC e will be greater than that desired 
for TIC e since the former will be run with equitability in mind, and so TIC e will be a trivial side-product of the computation 

of MIC e . 
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Figure 9: The trade-off between equitability and power against statistical independence across methods. For each 
method, average power as quantified in Figure 7a is plotted as well as the worst-case equitability under the same 
model, with n — 500. For every parametrized method, a point is plotted for each value of the parameter in question. 
The points corresponding to MIC e are emphasized. Since each coordinate is strictly preferable to all coordinates 
below and to the left of it, there is a Pareto “power-equitability” front. The methods with points along this front are 
MIC e , maximal correlation, TIC e , and S DDP . 


6.1 Demonstrating the power-equitability trade-off 

We examined the equitability and power against independence of MIC e for values of a ranging from 0.25 to 
0.9, at a sample size of 500. By plotting worst-case equitability against average power for each value of <a, 
we sought to understand whether there is a Pareto front of equitability/power beyond which we cannot seem 
to advance. The existence of such a boundary would support the existence of a power-equitability trade-off. 
We performed a similar analysis for all of the statistics whose power against independence and equitability 
we assessed. 

Figure 9 shows that every parametrized method with a non-trivial level of equitability does indeed 
exhibit such a trade-off. In the case of MIC e , the trade-off is captured by the parameter <a, which controls 
the maximal grid resolution used by the statistic. This is consistent with the bias-variance analysis in 
Reshef et al. [2015a], which showed that low values of a lead to better performance in the low-signal regime 
while larger values of a lead to better performance in mid-to-high-signal regimes. It is also consistent with 
the intuition that disallowing high-resolution grids may increase power against independence but will allow 
only coarse-grained distinguishability among distributions, while allowing high-resolution grids might enable 
distinguishing between distributions that may be more similar to each other. 

Figure 9 is also a useful summary of how the different methods we considered compare to each other 
along these two dimensions (for this sample size and set of relationships). Specifically, if one point is both 
above and to the right of another then it is strictly preferable. Thus, the figure shows a Pareto front of 
methods that offer optimal performance with respect to power against independence and equitability. This 
front includes MIC e , maximal correlation, TIC e , and S DDP . 

6.2 Choosing parameters for MIC e /TIC e : a practical guide 

We now give some guidelines for setting parameters for MIC e /TIC e more generally. The two parameters 
required by these statistics are the parameter a discussed above, which governs the maximal grid resolution 
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B(n) of the estimator according to B(n) = n a , and c, an optional parameter that controls a speed-versus- 
optimality trade-off in the algorithm. We discuss each of these in turn. 

6.2.1 Choosing a 

There are two main considerations involved in choosing a. The first, which is suggested by the analysis above, 
is how much we care about power against independence relative to equitability (i.e., power at distinguishing 
cleaner relationships from noisier relationships). The second consideration is whether we expect to see 
complex relationships in our data. These considerations can can be reframed in terms of hypothesis testing 
as follows: 

1. Is our null hypothesis statistical independence or presence of a weak dependence? 

It follows from the above analysis that when using MIC e (or, more likely, TIC e ) to generate tests for 
statistical dependence one should use a lower value of o, while if one is interested in equitability, a 
larger a is required. 

2 . What is our most complex alternative hypothesis? 

Since a places an upper bound on the resolution of grids that can be explored by the estimators, it 
restricts the complexity of structure that can be detected. Thus, as the relationship class of interest 
grows to include more complex structure relative to sample size, the value of a should be increased 
accordingly. 

Balancing the two considerations For the specific values of a that maximized power against indepen¬ 
dence of TIC e and equitability of MIC e , respectively, in our analyses, see Appendix E. The tables generally 
show that a) when optimizing for statistical power against independence in the sample-size regimes analyzed 
here, one should use an a that leads to B(n) being approximately between 4 (for less complex alternative 
hypotheses) and 12 (for more complex alternative hypotheses) 9 , and b) when optimizing for equitability, one 
should use an a approximately between 0.5 (when n is larger) and 0.75 (when n is smaller). 

Equitability and computational efficiency For large n, the parameters suggested above for equitability 
are likely needlessly computationally expensive. This is because as n grows, the maximal allowed grid 
resolution of the statistic B{n) = n a will outstrip the complexity of most alternative hypotheses that we are 
liable to encounter in practice. 

For example, at n = 5, 000, B(n) = 70 provides good equitability on the set of functions and noise models 
tested in this paper. If this level of equitability is acceptable to us, we may set a = log n 70 for n > 5, 000, 
which means that B(n) = 70 always. Given that the runtime of the search procedure in MIC e is 0(n 5a / 2 ), 
which is 0(n) for a = 0.4, a less extreme version of this strategy that maintains consistency and gives 
asymptotically linear runtime is to allow a to decrease for large n until a = 0.4 is reached, and then to keep 
it at 0.4. In the example above, this happens around n = 40, 000. And indeed, the equitability of MIC e at 
this sample size with a = 0.4 appears quite good. 

For more on how to balance runtime and equitability, see Figure D.l, which graphs equitability on our set 
of functional relationships against runtime as a and n are varied, as well as Table E.4, which suggests values 
of a at several sample sizes that yield 80% of the best observed equitability for MIC e at each sample size, 
and the discussion in the next section, where we examine the runtime of MIC e compared to other statistics. 

6.2.2 Choosing c 

The parameter c determines the coarseness of the discretization of the grid search in the algorithm that 
computes MIC e , with larger values of c corresponding to finer discretization [Reshef et ah, 2015a]. Charac¬ 
terizing the effect of c on the bias and variance of MIC e is an important avenue of future work. However, 

9 Of course, for even more complex alternative hypotheses, a larger B(n) will lead to better performance, provided the 
sample size allows for detection of the level of complexity in question. In particular, we suspect that B(n ) > cn(l) is necessary 
for consistency against all alternatives of the resulting independence test. Note however that this hypothesis applies only to 
MICe/TICe and not to MIC/TIC, because even just estimating the first entry M(X,Y) 2,2 of the population characteristic 
matrix yields a statistic that is consistent against all alternatives. (See, e.g., Lemma 6.7 in the supplemental online materials 
of Reshef et al. [2011].) 


27 



Sample Size 

P 

Max. Corr. 

RDC 

dCor 

HSIC 

HHG 

50 

0.0001 

0.0004 

0.0015 

0.0010 

0.0016 

0.0017 

100 

0.0001 

0.0005 

0.0014 

0.0014 

0.0032 

0.0063 

500 

0.0001 

0.0014 

0.0023 

0.0504 

0.0847 

0.2185 

1,000 

0.0002 

0.0025 

0.0035 

0.3518 

0.4886 

1.0956 

5,000 

0.0002 

0.0119 

0.0129 

6.1402 

6.5975 

34.0171 

10,000 

0.0002 

0.0239 

0.0251 

25.9859 

25.7333 

465.3222 


Sample Size 

MIC e [P] 

MIC e [FE] 

MICe [E] 

MIC 

I (Kraskov) 

S DDP {m = 3) 

50 

0.0004 

0.0009 

0.0021 

0.0015 

0.0096 

0.0010 

100 

0.0005 

0.0012 

0.0052 

0.0061 

0.0100 

0.0023 

500 

0.0018 

0.0079 

0.1630 

0.2187 

0.0122 

0.0529 

1,000 

0.0037 

0.0172 

0.1992 

0.9628 

0.0150 

0.2122 

5,000 

0.0195 

0.0974 

0.3398 

18.7627 

0.0427 

5.7464 

10,000 

0.0398 

0.1819 

0.6835 

66.2238 

0.0927 

23.4473 


Table 2: Average runtimes, in seconds, of algorithms for computing measures of dependence over 100 trials of 
uniformly distributed, independent samples at a range of sample sizes. Results for MIC e , are presented for three 
sample-size-dependent parameter settings that optimize for maximal power against independence ([P]), 99% of optimal 
equitability ([E]), and 80% of optimal equitability (fast equitability, [FE]). For a list of the parameters used in each 
of these settings, see Table E.4. TIC e is ommitted because its runtime is very similar to MIC e [P]. In this analysis, 
the Kraskov mutual information estimator was run using a pre-compiled C binary, MIC was computed approximately 
using the APPROX-MIC algorithm [Reshef et ah, 2011] in Java, and MIC e was run in Java. The other statistics 
were run using their respective R functions/packages. Note that dCor was run with the standard R package, which 
is 0(n 2 ); as of this writing there is a faster estimator of the same population quantity that is computable in time 
O(nlogn) [Huo and Szekely, 2014]. 


using c = 5 seems to provide good performance in most settings, and in more computationally constrained 
settings setting even c = 1 appears to result in only moderate performance loss [Reshef et ah, 2015a]. 


7 Runtime analysis 

Computational efficiency is often desirable when evaluating dependence, and here we assess the runtimes 
associated with the set of measures of dependence examined. 

7.1 Setting up the analysis 

Since the runtime of MIC e /TIC e depends on parameter choice, results for MIC e are presented for parameter 
settings recommended for maximizing equitability, maximizing power against independence, and attaining 
reasonable equitability on a limited computational budget. The third set of parameters was computed by 
searching at each sample size for the parameters that resulted in the fastest runtime while still yielding 80% 
of the best observed equitability at that sample size. All the parameters used for MIC e /TIC e in this analysis 
are detailed in Table E.4. 

The only other method whose runtime is affected by its parameter was S DDP . Since S DDP did not 
achieve non-trivial levels of equitability, we set its parameter to the value that maximized power against 
independence. 10 For statistics whose runtimes did not depend on parameter choice, defaults were used (see 
Appendix E). 

10 Since the runtime of S DDP as a function of its integer-valued parameter m is 0(n m_1 ) for m = 2,3,4, the choice of m 
heavily affects the runtime. This is significant because the parameter setting that maximizes power against independence can be 
computed in different ways that lead to different values of m: when power is measured by average area under the power curve, 
m — 3 performed the best with m = 2 a close second; in contrast, when power is measured via the minimum R 2 necessary to 
achieve a certain level of power, m = 4 was the best with m = 3 and m = 2 performing significantly worse. (See Section 5.1.4 
for a description of these methods of quantifying power.) We therefore have chosen m = 3 here. However, the correct choice of 
parameter for this statistic will likely depend on the use-case and the available computational budget. For the performance of 
gDDP other values of m, see the online supplement. 
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7.2 Results 


The results of our runtime analysis, found in Table 2, show several things. First, MIC e with all three of the 
parameter settings given is substantially faster than the previously introduced MIC statistic from Reshef 
et al. [2011] run using default parameters. This matches the theoretical analysis in Reshef et al. [2015a], 
which shows that the complexity of the search procedure in MIC e is 0(n 5a / 2 ) whereas the complexity of 
the search procedure in the APPROX-MIC algorithm used to compute MIC is 0(n 4a ). Second, even when 
equitability is prioritized, the runtime of MIC e is comparable with or faster than that of most of the other 
leading measures of dependence. The two exceptions to this are RDC and maximum correlation, which are 
both quite fast even at very large sample sizes. 

We note one interesting feature of the runtime of MIC e . Since estimating MIC* involves a search proce¬ 
dure, runtimes for estimating it are substantially faster when data contain less noise; as such, the runtimes on 
statistically independent presented in Table 2 represent worst-case performance. When run on data drawn 
from a noiseless linear relationship at the same sample sizes, MIC e ran 10%-75% faster across the range of 
sample sizes tested when using settings that optimize for equitability, 5%-50% faster across the sample sizes 
tested when using settings intended to achieve equitability on a limited computational budget, and 10%-30% 
faster across the sample sizes tested when using settings that optimize for power against independence. The 
runtime of S DDP exhibited a similar phenomenon, but the runtimes of the other methods were insensitive 
to the level of structure present and did not exhibit this effect. 

7.3 Discussion 

In this section we analyzed the runtimes of MIC e /TIC e alongside other leading measures of dependence at 
sample sizes ranging from 50 to 10,000. Our main finding is that MIC e /TIC e is faster than or comparable to 
most of the other methods tested, and is much faster than its predecessor MIC. Specifically, with parameters 
chosen to yield state-of-the-art power for TIC e and approximately 80% of the best achievable equitability 
for MIC e , both statistics can be computed on a sample size of 5,000 in 97 milliseconds. For a data set 
with n = 5,000 consisting of 1,000 variables, this translates into a total runtime of 8.1 minutes to compute 
both statistics for all variable pairs on a cluster with 100 nodes. These numbers imply that analysis of even 
relatively large data sets is possible using MIC e and TIC e . 

We emphasize that our results represent a snapshot based on currently available implementations. Just as 
MIC e has provided an improvement over APPROX-MIC, and just as recent advances are providing ways for 
estimating distance correlation in time O(nlogn) rather than 0(n 2 ), we expect that with time algorithmic 
improvements will allow for more efficient computation of some of the newer methods analyzed here. 

8 Conclusion 

In this paper, we presented an in-depth empirical evaluation of the equitability, power against independence, 
and runtime of several leading measures of dependence, including two new statistics introduced in Reshef 
et al. [2015a]. Our aims were to give an accessible exposition of equitability and its relationship to power 
against independence, provide the community with a comprehensive and rigorous side-by-side comparison 
of existing methods, and evaluate the new statistics against the existing state of the art. Our main findings 
were as follows. 

1. Equitability. MIC e , the estimator of the population MIC introduced in Reshef et al. [2015a], gener¬ 
ally has superior and more robust equitability with respect to R 2 than other measures of dependence. 
In some specific settings (models with no noise in the independent variable and n = 5,000), mutual 
information estimation achieves superior equitability in our experiments, but its equitability is other¬ 
wise highly variable and often poor, particularly at lower sample sizes. Maximal correlation achieves 
some degree of equitability over the models examined, but all other statistics tested have very poor 
equitability. 

2. Power against independence. TIC e , a statistic introduced in Reshef et al. [2015a], shares state-of-the-art 
power against independence with Heller and Gorfine’s S DDP , with both methods generally performing 
very well and alternately outperforming each other in different settings. MIC e also has power against 
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independence that is competitive with the state of the art, albeit under parameter settings that differ 
from those that confer good equitability. Moreover, the power of independence testing using TIC e 
and MIC e is much less sensitive than that of the other methods examined to alternative hypothesis 
relationship type. The original statistic MIC has substantially higher power against independence than 
has been reported in previous analyses when a different parameter setting is used. Finally, distance 
correlation, maximal correlation, HHG, HSIC, and RDC also had good power against independence. 

3. Power/equitability tradeoff. The parameter a in the estimator MIC e corresponds to a trade-off between 
power against independence and equitability that is consistent with the characterization of equitability 
given in Reshef et al. [2015b]. Lower values of a lead to higher power against a null of independence at 
the expense of power against null hypotheses representing weak relationship strength (i.e., equitability), 
while higher values of a lead to better equitability at the expense of power against independence. 

4. Runtime. MIC e and TIC e , each of which can be trivially computed once the other has been obtained, 
have runtimes that allow them to be run together even on large samples in reasonable time. This 
runtime compares favorably with that of other complex measures of dependence such as S DDP , dCor, 
HSIC, and HHG. The fastest measures of dependence were maximal correlation and the random¬ 
ized dependence coefficient. There is a large variety of runtimes across the measures of dependence 
examined. 

There are several important takeaways from our results. First, they suggest that using MIC e and TIC e 
in tandem to filter relationships and rank them by strength is a statistically sound and computationally 
efficient strategy for exploratory data analysis. In particular, one can imagine a system in which first TIC e 
is computed for all relationships and only the significant ones are kept, and then MIC e with equitability- 
optimized parameters is examined only for the latter set. Since TIC e enjoys high power against independence 
on a wide range of alternative hypothesis relationship types, pre-filtering with TIC e in this way will not result 
in important relationships being overlooked due to their relationship type. Any measure of dependence 
deemed to have sufficient power on a broad range of alternative hypotheses can be substituted for TIC e . 
However, since TIC e and MIC e can be computed simultaneously, and since TIC e offers state-of-the-art power 
against independence, using TIC e appears to be a preferable choice in such a scenario. 

Second, the fact that many measures of dependence performed similarly in our analysis of power against 
independence, as well as in analyses of real data sets that others have performed (see, e.g., Heller et al. 
[2014]), suggests that power against independence may not be where the true challenge lies for bivariate 
relationships, and that we ought to demand more of the measures of dependence that we use in this setting. 
Equitability is one attempt to formulate a more ambitious goal, as is the concept of low detection threshold 
introduced in Reshef et al. [2015b], but there may well be other possibilities. Of course, for higher-dimensional 
relationships, even just power against independence is very difficult to achieve, and many of the methods 
evaluated here are quite useful in that setting. 

Finally, the comprehensiveness of our results provides significant understanding of the comparative per¬ 
formance of various measures. To our knowledge, our analyses are the most exhaustive to date in that they 
evaluate a large swath of measures of dependence side-by-side along a number of dimensions (equitability, 
power against independence, and runtime); over a wide range of models, relationship types, and sample 
sizes; and with parameters that are optimized for each individual statistic in each analysis. Our hope is that 
the full set of results, which can be downloaded in bulk at http://www.exploredata.net/ftp/empirical_ 
supplement.zip will be a resource to the community that enables more consistent, direct comparisons be¬ 
tween different measures of dependence, and facilitates a precise discussion of the trade-offs and assumptions 
associated with each one in various settings. 

While the results presented here make a compelling case for the use of MIC e and TIC e and provide insight 
into the trade-offs between different measures of dependence, there are some important limitations for both 
the new statistics and the comparisons we performed. First, in this paper we evaluated only equitability 
with respect to R 2 on noisy functional relationships, whereas the definition we give of equitability explicitly 
acknowledges the possibility of using other properties of interest besides R 2 and standard relationships that 
are not noisy functional relationships. We feel that R 2 is an important property of interest that is intuitive 
and familiar to many practitioners, but equitability with respect to other properties of interest merits study 
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as well, and the methods tested here may perform much better or worse when their equitability is evaluated 
with respect to other properties of interest. 

Additionally, though an attempt at comprehensiveness was made, we did limit our scope to the set of 
noisy functional relationships in Reshef et al. [2011] for equitability and the relationships introduced in Simon 
and Tibshirani [2012] for power against independence. While we feel each of these suites of relationships 
provides reasonable insight into the performance of the methods in question, there are relationships that, 
when added to these suites, result in extremely poor performance for all the methods tested. Characterizing 
those relationships theoretically in both the setting of equitability and that of power against independence 
is important if we are to fully understand the strengths and weaknesses of each of these methods. This is an 
important direction for future work. 

Measures of dependence are useful in a variety of settings and identifying which measures of dependence 
provide superior performance in the face of different objectives, assumptions, and constraints is critical. For 
each separate goal, we must understand both which measure of dependence is most appropriate and also 
which parameter settings lead to the best performance. Such an understanding provides insight into the 
inherent trade-offs of different methods, allowing us to navigate the landscape of measures of dependence 
more effectively and — ultimately — to better understand our data. 
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A Data generation 

A.l Definitions of functions used 

Tables A.l and A .2 contain the definitions of the functions used to assess the equitability and statistical 
power against independence, respectively, of measures of dependence throughout this paper. The functions 
used for all analyses of power against independence (Table A. 2 ) are taken from Simon and Tibshirani [ 2012 ]. 


# 

Function Name 

Definition 



1 

Cosine, High Freq 

y = cos(147tcc) 

x e 

o, l 


2 

Cosine, Non-Fourier Freq [Low] 

y = cos(77rx) 

x e 

o, l 


3 

Cosine, Varying Freq [Medium] 

y = sin(57ra;(l + x)) 

x e 

o, l 


4 

Cubic 


y = Ax 6 + x 2 — Ax 

x e 

[-1.3,1.1] 

5 

Cubic, Y-stretched 

y = 41(4a^ + x z — Ax) 

x e 

[-1.3,1.1] 

6 

Exponential 

[10*] 

II 

o 

B 

x e 

o, h 

i 

7 

Exponential 

[2*] 

y = 2 x 

x e 

o, n 

i 

8 

L-shaped 


( x/99 if x < 

y = \ ' ., ^ 99 

[ 1 if X > 100 

x e [o, l] 

9 

Line 


y = x 

x e 

o, l 


10 

Linear+Periodic, High Freq 

y = itt sin(10.6(2a; - 1)) + y^{2x - 1) 

x e 

0, 1 


11 

Linear+Periodic, High Freq 2 

y « | sin(10.6(2a; - 1)) + i^(2x - 1) 

x e 

0, 1 


12 

Linear+Periodic, Low Freq 

V = | sin(4(2a; - 1)) + j^(2x - 1) 

x e 

0, 1 


13 

Linear+Periodic, Medium Freq 

y = sin(107r^) + x 

x e 

o, l 


14 

Lopsided L-shaped 

[ 299x if x < 2 ^q 

y = j — 198a? + jqq if 2 oo — x Too 

l — M + M if x > Tho 

x e [o, l] 

15 

Parabola 


II 

tc 

x e 

hi 

41 

16 

Sigmoid 


f 0 if * < + 

v = < 50(x - i) + i if + < * < + 

l 1 if * > TO 

x e [o, l] 

17 

Sine, High Freq 

y = sin(167r^) 

x e 

o, i 


18 

Sine, Low Freq 

y = sin(87rcc) 

x e 

o, i; 


19 

Sine, Non-Fourier Freq [Low] 

y = sin(97ra;) 

x e [o, l] 

20 

Sine, Varying Freq [Medium] 

y = sin(67rcc(l + x)) 

x e [o, i] 




[20 if x < ± 


21 Spike 

H 

— 18a; + q§ if 2 o<a;<jQ 

( — f + 9 if ^ ^ Jo 

x e [o, i] 


Table A.l: Definitions of the functions used to analyze equitability. Under noise/sampling models containing noise 
in the independent variable or independent-variable marginal distributions other than Ef( X ) or C7/(x)? functions 6, 
8, 14, 16, and 21 were excluded due to poor performance across all methods tested. This is presumably due to the 
fact that a) horizontally perturbing points in a very steep portion of a function drastically changes the distribution in 
question, and b) sampling uniformly along the x-axis drastically under-samples a large part of the graph of a function 
if that graph contains very steep portions. 


Function Name Definition 


Line 

y = x 


x e [o, l] 

Quadratic 

y = Ax z 


x £ 

[-4 

41 

Cubic 

y = 12 8(x- i) 3 

— 48(a? — ^) a — 12(a; — ^) 

x £ 

[0,1] 

Sinusoid (8 periods) 

y = sin(167ra;) 


x £ 

o, 1' 


Sinusoid (2 periods) 

y = sin(47ra?) 


x £ 

0, i[ 


a; 1 / 4 

y = x 1/A 


x £ [0, 1] 

Circle 

y = ± V 1 ~ ( 2x - 

-1)2 

x £ [0, 1] 



f 0 if x < h 


Step 

tf ~fl if x > | 

X £ [0, 1] 


Table A.2: Definitions of the functions from Simon and Tibshirani [2012] used to analyze statistical power against 
independence. 
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A.2 Generating a sample from a distribution with a specified R 2 

Given a noisy functional relationship of the form (X + s,f(X) + s '), the R 2 of the relationship is the 
correlation between f(X + e) and /(X) + s'. Many of the equitability and power analyses performed in this 
work require the ability to set 5 and s' such that the resulting distribution has a given population R 2 . 

In the case that s = 0 and the variance of s' is known, the R 2 of the distribution has a closed form 
expression given in Reshef et al. [2011]. If we specialize that expression to the case we consider in this paper, 
wherein s' ~ AT(0, cr 2 ), and then solve for cr, we obtain the following expression. 

°{R 2 ) = ^var(/(X)) Jj- - 1^ 

In cases that include noise in the independent variable, we set 5 and (s' if the noise model requires) by 
binary search, using the sample R 2 of a very large sample as an estimate of the population R 2 . 

B Additional equitability results 


Sample Size P x (X) 


Maximal 
Con. (ACE) 


I [L 2 ] I [L 2 ] 

(Kraskov, k=l) (Kraskov, k=6) 


w = 250 Even Along /(X) 7-Noise 

Even Along X 7-Noise 

Even Along /(X) X7Noise 

Even Along X X7Noise 

Even Along /(X) X-Noise 

Even Along X X-Noise 

Uniform Along /(X) 7-Noise 
Uniform Along X 7-Noise 
Uniform Along /(X) X7Noise 
Uniform Along X X7Noise 
Uniform Along /(X) X-Noise 
Uniform Along X X-Noise 

Worst Case 


1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 

0.68 1.00 


0.58 

0.52 

0.63 

0.66 

0.64 

0.66 

0.56 

0.51 

0.61 

0.68 

0.61 

0.68 


1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 


1.00 1.00 

1.00 0.99 

1.00 1.00 

1.00 1.00 

1.00 1.00 

1.00 1.00 

1.00 0.65 

1.00 0.59 

1.00 0.69 

1.00 0.76 

1.00 0.70 

1.00 0.81 


1.00 1.00 

0.92 1.00 

1.00 1.00 

0.88 1.00 

1.00 1.00 

0.90 1.00 

1.00 1.00 

0.87 1.00 

1.00 1.00 

0.98 1.00 

1.00 1.00 

0.98 1.00 


1.00 

0.98 

1.00 

1.00 

1.00 

1.00 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 


1.00 1.00 


1.00 


1.00 


0.98 

0.99 

1.00 

0.99 

0.99 

0.99 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

1.00 


0.88 

0.87 

0.94 

0.98 

0.94 

0.98 

0.83 

0.83 

0.94 

0.98 

0.94 

0.98 

0.98 



0.65 0.72 


Even Along /(X) 
Even Along X 
Even Along /(X) 
Even Along X 
Even Along /(X) 
Even Along X 
Uniform Along /(X) 
Uniform Along X 
Uniform Along /(X) 
Uniform Along X 
Uniform Along /(X) 
Uniform Along X 


7-Noise 

7-Noise 

X7Noise 

X7Noise 

X-Noise 

X-Noise 

7-Noise 

7-Noise 

X7Noise 

X7Noise 

X-Noise 

X-Noise 

Worst Case 


1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 

0.65 1.00 


0.53 

0.48 

0.59 

0.63 

0.58 

0.63 

0.52 

0.48 

0.59 

0.65 

0.59 

0.65 


» = 5000 


Even Along /(X) 
Even Along X 
Even Along /(X) 
Even Along X 
Even Along /(X) 
Even Along X 
Uniform Along /(X) 
Uniform Along X 
Uniform Along /(X) 
Uniform Along X 
Uniform Along /(X) 
Uniform Along X 


7-Noise 

7-Noise 

X7Noise 

X7Noise 

X-Noise 

X-Noise 

7-Noise 

7-Noise 

X7Noise 

X7Noise 

X-Noise 

X-Noise 


Worst Case 


1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 

0.54 1.00 


0.44 

0.40 

0.48 

0.53 

0.48 

0.53 

0.44 

0.41 

0.49 

0.54 

0.49 

0.54 


1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 


1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 


0.45 

0.43 


0.97 

0.95 

0.97 

0.96 

0.45 

0.43 

0.57 

0.67 

0.59 

0.73 


0.96 

0.84 

0.95 

0.87 

0.69 

0.48 

0.83 

0.75 

0.84 

0.80 


1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 


0.18 

0.18 

0.41 

0.56 

0.77 

0.83 


0.17 

0.18 

0.37 

0.53 

0.41 

0.61 


0.08 

0.07 


0.32 

0.49 

0.37 

0.58 


0.09 

0.07 


0.30 

0.47 

0.34 

0.55 


1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 


0.98 

0.97 

0.98 

0.98 

0.98 

0.98 

0.96 

0.95 

0.98 

0.98 

0.98 

0.98 

0.98 


0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 

0.98 



0.70 

0.70 


0.77 

0.95 


0.98 0.95 0.51 0.56 



Table B.l: A summary of the worst-case equitability of measures of dependence for a variety of noise models, 
independent-variable marginal distributions, and sample sizes. [Smaller values correspond to better equitability.] 
Each number is a worst-case interpretable interval length for a given statistic in a given setting. Therefore, smaller 
numbers indicate shorter interpretable intervals and more equitable behavior. Table cells are colored proportionally 
(red = interval of length 0; white = interval of length 1). The equitability of MIC e is relatively robust to factors like 
noise models, independent variable marginal distributions, and sample size. Figures analogous to Figures 4 and B.l 
for all the settings presented in this table are included in the online supplementary materials. For statistics whose 
performance was dependent on parameter settings, we present for each sample size the best results across parameter 
values tested. Results are not presented for HHG for n — 5, 000 as it was prohibitively computationally expensive to 
analyze at this sample size. 
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Model 


Sample Size P x {X) 


Maximal 
Corr. (ACE) 


I [l/] I [L 2 ] 

(Kraskov, k=l) (Kraskov, k=6) 


TIC e _MICe_Ml£ 


Even Along /(X) 

7-Noise 

Even Along X 

7-Noise 

Even Along /(X) 

X7Noise 

Even Along X 

X7Noise 

Even Along /(X) 

X-Noise 

Even Along X 

X-Noise 

Uniform Along /(X) 

7-Noise 

Uniform Along X 

7-Noise 

Uniform Along /(X) 

X7Noise 

Uniform Along X 

X7Noise 

Uniform Along /(X) 

X-Noise 

Uniform Along X 

X-Noise 

Average Case 

Even Along f(X) 

7-Noise 

Even Along X 

7-Noise 

Even Along /(X) 

X7Noise 

Even Along X 

X7Noise 

Even Along /(X) 

X-Noise 

Even Along X 

X-Noise 

Uniform Along /(X) 

7-Noise 

Uniform Along X 

7-Noise 

Uniform Along /(X) 

X7Noise 

Uniform Along X 

X7Noise 

Uniform Along /(X) 

X-Noise 

Uniform Along X 

X-Noise 

Average Case 

Even Along /(X) 

7-Noise 

Even Along X 

7-Noise 

Even Along /(X) 

X7Noise 

Even Along X 

X7Noise 

Even Along /(X) 

X-Noise 

Even Along X 

X-Noise 

Uniform Along /(X) 

7-Noise 

Uniform Along X 

7-Noise 

Uniform Along /(X) 

X7Noise 

Uniform Along X 

X7Noise 

Uniform Along /(X) 

X-Noise 

Uniform Along X 

X-Noise 

Average Case 


0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 


0.50 
0.50 
0.50 
0.50 
0.50 
0.50 
0.50 
0.50 
0.50 
0.50 
0.50 
0.50 
0.50 


0.38 

0.31 

0.41 

0.41 

0.41 

0.41 

0.37 

0.32 

0.40 

0.42 

0.41 

0.42 

0.36 

0.29 

0.39 

0.40 

0.40 

0.40 

0.36 

0.30 

0.39 

0.41 

0.40 

0.41 


0.30 

0.23 

0.33 

0.34 

0.33 

0.35 

0.31 

0.23 

0.34 

0.35 

0.34 

0.35 


0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.48 

0.48 

0.48 

0.48 

0.48 

0.48 

0.48 

0.48 

0.48 

0.48 

0.48 

0.48 

0.48 


0.50 0.50 

0.50 0.50 

0.50 0.50 

0.50 0.50 

0.50 0.50 

0.50 0.50 

0.50 | 0.36 

0.50 0.33 

0.50 0.40 

0.50 0.43 

0.50 0.41 

0.50 0.44 


0.50 0.50 

0.49 0.50 

0.50 0.50 

0.48 0.50 

0.50 0.50 

0.49 0.50 

0.50 0.50 

0.48 0.50 

0.50 0.50 

0.49 0.50 

0.50 0.50 

0.49 0.50 


0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 


0.50 0.45 


0.49 


0.50 


0.50 



0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 


0.50 

0.49 

0.50 

0.49 

0.50 

0.49 

0.49 

0.49 

0.49 

0.49 

0.49 

0.49 

0.49 



0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 

0.50 


0.45 

0.45 

0.49 

0.49 

0.49 

0.49 

0.45 

0.44 

0.48 

0.48 

0.48 

0.48 


0.28 

0.26 

0.34 

0.40 

0.34 

0.40 

0.29 

0.28 

0.34 

0.41 

0.35 

0.42 


0.3 


0.32 

0.30 

0.37 

0.43 

0.37 

0.43 

0.32 

0.30 

0.37 

0.43 

0.37 

0.43 




Table B.2: A summary of the average-case equitability of measures of dependence for a variety of noise models, 
independent-variable marginal distributions, and sample sizes. [Smaller values correspond to better equitability.] Each 
number is an average interpretable interval length for a given statistic in a given setting. Therefore, smaller numbers 
indicate shorter interpretable intervals on average and more equitable behavior. Table cells are colored proportionally 
(red = interval of length 0; white = interval of length 1). The equitability of MIC e is relatively robust to factors like 
noise models, independent variable marginal distributions, and sample size. Figures analogous to Figures 4 and B.l 
for all the settings presented in this table are included in the online supplementary materials. For statistics whose 
performance was dependent on parameter settings, we present for each sample size the best results across parameter 
values tested. Results are not presented for HHG for n = 5, 000 as it was prohibitively computationally expensive to 
analyze at this sample size. 
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RDC 1 P- 1 (Kraskov) HSIC Max. Corr. (ACE) Distance Corr. 


n=250 


n=500 


n=5000 



Cosine, High Freq 
Cosine, Non-Fourier Freq [Low] 
Cosine, Varying Freq [Medium] 
Cubic 

Cubic, Y-Stretched 
Exponential [10*1 
Exponential [2*| 

L-Shaped 

Line 

Linear+Periodic, High Freq 

Linear+Periodic, High Freq 2 

Linear+Periodic, Low Freq 

Linear+Periodic, Medium Freq 

Lopsided L-Shaped 

Parabola 

Sigmoid 

Sine, High Freq 

Sine, Low Freq 

Sine, Non-Fourier Freq [Low] 

Sine, Varying Freq [Medium] 

Spike 





R 2 [f(x),y] 



Figure B.l: The equitability of mea¬ 
sures of dependence on a set Q of noisy 
functional relationships with alternative 
noise model and marginal distribution. 
[Narrower is more equitable.] The re¬ 
lationships take the form (X, f(X ) + 
s') where s' is normally distributed 
with varying amplitude, and relation¬ 
ship strength is quantified by <f> = R 2 . 
The plots were constructed as described 
in Figure 2. In contrast to its poor eq¬ 
uitability under the noise model used 
in Figure 4, the Kraskov mutual infor¬ 
mation estimator, represented using the 
squared Linfoot correlation, is quite eq¬ 
uitable under this noise model at large 
sample sizes. At the low and mid-range 
sample sizes, MIC e remains more equi¬ 
table. For every parametrized statis¬ 
tic whose parameter meaningfully af¬ 
fects equitability, results are presented 
at each sample size using parameter set¬ 
tings that maximize equitability across 
all twelve of the noise/marginal distri¬ 
butions tested at that sample size. See 
Tables B.l and B.2 for a summary of the 
equitability of these measures of depen¬ 
dence under those additional models, as 
well as the supplemental materials for 
the corresponding figures. 
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(X + e, f{X )) (X + e, f{X)) (.X+e, f(X) + e') (X + e, f(X) + e') (X, f(X) + e') (X, f(X) + e') 

P x (X): Even Along X P x (X): Even Along f(X) p x (X): Even Along X p x (X): Even Along f(X) p x (X): Even Along X P x (X): Even Along f(X) 


MIC 







l[L 2 ] 




Cosine, High Freq 
Cosine, Non-Fourier Freq [Low] 
Cosine, Varying Freq [Medium] 
Cubic 

Cubic, Y-Stretched 
Exponential [10 x ] 

Exponential [2 X ] 

L-Shaped 

Line 

Linear+Periodic, High Freq 

Linear+Periodic, High Freq 2 

Linear+Periodic, Low Freq 

Linear+Periodic, Medium Freq 

Lopsided L-Shaped 

Parabola 

Sigmoid 

Sine, High Freq 

Sine, Low Freq 

Sine, Non-Fourier Freq [Low] 

Sine, Varying Freq [Medium] 

Spike 





Figure B.2: The equitability of MIC* 
and mutual information in the infinite 
data limit. [Narrower is more equi¬ 
table.] Six combinations of noise mod¬ 
els and independent variable marginal 
distributions were analyzed. The val¬ 
ues of MIC* were computed using the 
newly introduced algorithm from Reshef 
et al. [2015a]. In each plot, the worst- 
case interpretable interval is indicated 
by a red line, and both the worst- and 
average-case equitability are listed. Mu¬ 
tual information values are represented 
in terms of the squared Linfoot corre¬ 
lation. In the large-sample limit, mu¬ 
tual information is more equitable than 
MIC* in settings where there is noise 
only in the dependent variable, while 
MIC* has superior equitability other¬ 
wise. 
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Figure B.3: The equitability of 
measures of dependence on noisy 
functional relationships with noise 
in the dependent variable only, vi¬ 
sualized in terms of power. [Red¬ 
der is more equitable.] The set 
Q of noisy functional relationships 
analyzed is the same as in Fig¬ 
ure B.l, and relationship strength 
is again quantified by $ = R 2 . 
Plots were generated as in Fig¬ 
ure 6. In contrast to its perfor¬ 
mance under the noise model used 
in Figure 4, the Kraskov mutual 
information estimator yields pow¬ 
erful tests under this noise model 
at large sample sizes. At the low 
and mid-range sample sizes, tests 
based on MIC e remain more power¬ 
ful. For every parametrized statis¬ 
tic whose parameter meaningfully 
affects equitability, results are pre¬ 
sented at each sample size using pa¬ 
rameter settings that maximize eq¬ 
uitability across all twelve of the 
noise/marginal distributions tested 
at that sample size. See Tables B.l 
and B.2 for a summary of the equi¬ 
tability of these measures of depen¬ 
dence under those additional mod¬ 
els, as well as the supplemental ma¬ 
terials for the corresponding fig¬ 
ures. 
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C Parameter sweeps for power against independence 


MIC 



MIC 

e 


TIC 

e 




0.8 


£ 

o 

o- 0.4 


3 02 


Max. Corr. (ACE) 


Ave -opt =0 - 411 

Min._ , =0.130 



I (Kraskov) 



HSIC 



RDC 



Figure C.l: Power against independence as a function of the parameter of each measure of dependence. [Higher is 
more powerful] For each measure of dependence, we computed power curves over a range of parameters using the 
relationships from Simon and Tibshirani [2012]. In order to aggregate the power of a given test across relationship 
types, all power curves were computed as functions of the R 2 of the noisy relationship comprising the alternative 
hypothesis, and the area under each power curve was computed. Here, we show for each statistic the area under 
the power curve for each relationship type as a function of that statistic’s parameter. The black line represents the 
average area under the power curves across all relationship types, and the vertical dotted line represents the optimal 
parameter setting. Both the average and worst-case performance across relationship types are listed for the optimal 
parameter setting of each statistic. For the MIC statistic from Reshef et al. [2011], the red line represents the default 
parameter setting, which was used by Simon and Tibshirani. This parameter setting turns out to be poor for testing 
for independence; it is better suited for achieving equitability. For testing for independence, lower values of the 
parameter are better suited, though these incur a cost in terms of equitability. (See Figure 9.) 
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MIC 
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I (Kraskov) HSIC 




RDC 



Figure C.2: Power against independence as a function of the parameter of each measure of dependence, with overall 
power quantified differently than in Figure C.l. [Lower is more powerful.] As in Figure C.l, we compute power 
curves for a range of parameters of each measure of dependence using the relationships from Simon and Tibshirani 
[2012]. Here, in order to aggregate the power of a given test across relationship types, the power curve of each test 
was computed as a function of the R 2 of the noisy relationship being tested, and the R 2 at which 50% power is 
achieved for each relationship type was determined. This number is graphed for each relationship type and statistic 
as a function of that statistic’s parameter. The black line represents the average R 2 at which 50% power is achieved 
across all relationships tested, and the vertical dotted line represents the optimal parameter setting. Both the average 
and the worst-case performance across relationship types are listed for the optimal parameter setting of each statistic. 
For the MIC statistic from Reshef et al. [2011], the red line represents the default parameter setting, which was used 
by Simon and Tibshirani. This parameter setting turns out to be poor for testing for independence; it is better suited 
for achieving equitability. For testing for independence, lower values of the parameter are better suited, though these 
incur a cost in terms of equitability. (See Figure 9.) 
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D 


The Equitability-runtime trade-off 


n=250 


n=500 


3.5 r 5 



Average Runtime (s) 


n=5000 



Figure D.l: The relationship between equitability and runtime of MIC e . Sample sizes are n — 250 (left), 500 
(middle), and 5000 (right). Each plot shows, as a varies, the worst-case equitability of MIC e with the given value of 
a on the model used in Figure 9 graphed against the runtime of MIC e with the same value of a. The multiple series 
in every plot correspond to different values of c, with marker size indicating the size of c. The values of c used are 1, 
2, 3, 5, 10, and 15. (c = 10 and c = 15 are ommitted from the analysis for n = 5,000.) As a increases, we generally 
see a rise in equitability but also in runtime. 


E Parameter values used in analyses 

Parameter sweeps were performed for all methods in evaluating their equitability and statistical power against 
independence. 

Parameter values used in equitability analyses 

For each method, results are presented for the parameter values tested that maximized worst-case equitability 
across all models Q examined, at each sample size (see Table E.l). Results for all parameter values tested, 
including for some methods not included in the figures here due to space constraints, can be found in the 
online supplement at http://www.exploredata.net/ftp/empirical_supplement.zip. 

In the case of RDC and HSIC the parameter values tested did not have a strong effect on equitability, 
so we present performance for the default / rule of thumb parameter values. That is, the random sampling 
parameters, (S x ,S y ), of RDC and the RBF kernel bandwidth parameters, (a X: a y ), used for HSIC were set 
independently for each of the two samples being tested to the Euclidean distance empirical median (values 
of {0—, 25—, 50—, 75—, 100—}%-ile pairwise distances were also tested for these parameters). For RDC, the 
number of random features was set to k = 10. For the Kraskov mutual information estimator, k = 1, k = 6, 
k = 10, and k = 20 were tested. In the case of S DDP , values of m > 3 were prohibitively computationally 
expensive to run for this analysis. For MIC e , at n = 250, 500, and 5,000, the ranges of a tested were 
{0.60, 0.65,..., 0.80, 0.85}, {0.25,0.30,..., 0.85,0.90}, and {0.35, 0.40,..., 0.70, 0.75}, respectively. 


Sample size 

MIC 

a 

e 

C 

TIC e 

a 

c 

S DDP 

m 

I (Kraskov) 

k 

RDC 

S X ,Sy 

k 

HSIC 


250 

0.75 

15 

0.80 

3 

2 

6 

Median pair. dist. 

10 

Median pair. 

dist. 

500 

0.80 

5 

0.80 

3 

2 

6 

Median pair. dist. 

10 

Median pair. 

dist. 

5,000 

0.65 

3 

0.70 

3 

2 

6 

Median pair. dist. 

10 

Median pair. 

dist. 


Table E.l: Parameters used in the equitability analyses. 
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Parameter values used in statistical power analyses 

Tables E.2 and E.3 summarize the optimal parameters identified for tests for independence based on the 
methods examined, using area under the power curves and a 50% power threshold, respectively, as the 
optimization criterion. The parameters in Table E.2 were used to generate the power curves in Figure 8. 
The parameter ranges tested for each statistic can be observed from Figures C.l and C.2. 


Sample size 

MICe 

a c 

TIC e 

a c 

MIC 

a c 

gDDP 

m 

I (Kraskov) 

k 

RDC 

S X ,Sy 

k 

HSIC 

O' x , O' y 


100 

0.48 

5 

0.50 

5 

0.40 

5 

3 

13 

5%-ile pair. dist. 

10 

45%-ile pair. 

dist. 

500 

0.35 

5 

0.38 

5 

0.30 

5 

3 

50 

5%-ile pair. dist. 

10 

60%-ile pair. 

dist. 


Table E.2: Best parameters for testing for independence, identified by maximizing the average area under the power 
curves generated by a given test for the set of relationships examined. 


Sample size 

MICe 

TIC e 

MIC 

gDDP 

I (Kraskov) 

RDC 


HSIC 


a c 

a c 

a c 

m 

k 

S X , Sy 

k 

O x , &y 

100 

0.74 5 

0.96 5 

0.48 5 

5 

12 

5%-ile pair. dist. 

10 

30%-ile pair. dist. 

500 

0.56 5 

0.68 5 

0.36 5 

4 

41 

5%-ile pair. dist. 

10 

5%-ile pair. dist. 


Table E.3: Best parameters for testing for independence, identified by minimizing the average across relationship 
types of the minimal R 2 for which the power of a given test remained above 50%. 


Parameter values used in runtime analyses 

For methods whose runtime did not strongly depend on parameter settings, default parameter values were 
used. That is, the Kraskov mutual information estimator was run using k = 6, and the random sampling 
parameters, (S x ,S y ), of RDC and the RBF kernel bandwidth parameters, (a x ,a y ), used for HSIC were set 
independently for each of the two samples being tested to the Euclidean distance empirical median. In the 
case of RDC, the number of random features was set to k = 10, as in the runtime analysis in Lopez-Paz 
et al. [2013]. The parameters used for MIC e are presented in Table E.4. 


Sample size 

Power 

a c 

Fast equitability 

a c 

Equitability 

a c 

50 

0.54 

5 

0.75 

3 

0.85 

5 

100 

0.48 

5 

0.70 

2 

0.80 

5 

500 

0.36 

5 

0.65 

1 

0.80 

5 

1,000 

0.32 

5 

0.60 

1 

0.75 

4 

5,000 

0.26 

5 

0.50 

1 

0.65 

1 

10,000 

0.24 

5 

0.45 

1 

0.60 

1 


Table E.4: Parameters used in the runtime analysis of MIC e presented in Table 2. 


For MIC e , the three sample-size-dependent parameter settings optimize for maximal power against inde¬ 
pendence, 80% of optimal equitability (fast equitability), and 99% of optimal equitability. For sample sizes 
for which results were not available, parameter values were estimated via interpolation/extrapolation using a 
power curve. As pointed out in Section 6.2, these parameter settings depend on the set of relationships being 
examined, and, for example, for relationship suites with less complex relationships than the ones examined 
in the analyses here, lower values of a would perform well and be more computationally efficient. 
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